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Editor's  introduction 


For  years  many  teachers  of  economics  and  other  professional 
economists  have  felt  the  need  of  a  series  of  books  on  economic  subjects 
which  is  not  filled  by  the  usual  textbook,  nor  by  the  highly  technical 
treatise. 

This  present  series,  published  under  the  general  title  of  Economics 
Handbook  Series,  was  planned  with  these  needs  in  mind.  Designed 
first  of  all  for  students,  the  volumes  are  useful  in  the  ever-growing  field 
of  adult  education  and  also  are  of  interest  to  the  informed  general 
reader. 

The  volumes  present  a  distillate  of  accepted  theory  and  practice, 
without  the  detailed  approach  of  the  technical  treatise.  Each  volume 
is  a  unit,  standing  on  its  own. 

The  authors  are  scholars,  each  writing  on  an  economic  subject  on 

C  which  he  is  an  authority.     In  this  series  the  author's  first  task  was  not 

,  2  to  make  important  contributions  to  knowledge— although  many  of 

^  them  do — but  so  to  present  his  subject  matter  that  his  work  as  a 

„v«~J  scholar  will  carry  its  maximum  influence  outside  as  well  as  inside  the 

cv  classroom.     The  time  has  come  to  redress  the  balance  between  the 

energies  spent  on  the  creation  of  new  ideas  and  on  their  dissemination. 

0  Economic  ideas  are  unproductive  if  they  do  not  spread  beyond  the 

world    of    scholars.     Popularizers    without    technical    competence, 

v    unqualified  textbook  writers,  and  sometimes  even  charlatans  control 

i   too  laige  a  part  of  the  market  for  economic  ideas. 

In  the  classroom  the  Economics  Handbook  Series  will  seive,  it  is 
hoped,  as  brief  surveys  in  one-semester  courses,  as  supplementary 
reading  in  introductory  courses,  and  in  other  courses  in  which  the 
subject  is  related. 

Seymour  E.  Harris 
>  J    ,  v 
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Editor's  preface 


The  editor  welcomes  Stefan  Valavanis'  study  of  econometrics  into 
the  Economics  Handbook  Series  as  a  unique  contribution  to  eco- 
nometrics and  to  the  teaching  of  the  subject. 

Anyone  who  reads  this  book  will  understand  the  tragedy  of  the 
death  of  Stefan  Valavanis.  He  was  brilliant,  imaginative,  and  a  first- 
class  scholar  and  teacher,  and  his  death  is  a  great  loss  to  the  world  of 
ideas. 

Professor  Valavanis  had  virtually  completed  his  book  just  befora 
his  departure  for  Europe  in  the  summer  of  1958.  But,  as  is  alwaya 
true  of  a  manuscript  left  with  the  publisher,  though  it  was  essentially 
complete  much  remained  to  be  done.  My  colleague,  Professor  Alfred  H, 
Conrad,  volunteered  to  finish  the  job.  Unselfishly  he  put  the  final 
touches  on  the  book,  went  over  the  manuscript,  checked  the  math- 
ematics, assumed  the  responsibility  for  seeing  it  through  the  press, 
and  helped  in  many  other  ways.  Without  his  help,  the  problem  of 
publication  would  have  been  a  serious  one.  The  publisher  and  editor 
are  indeed  grateful. 

This  book  is  an  introduction  to  econometrics,  that  is,  to  the  tech- 
niques by  which  economic  theories  are  brought  into  contact  with  the 
facts.  While  not  in  any  sense  a  "cookbook,"  its  orientation  is 
constantly  toward  the  strategy  of  economic  research.  Within  the 
field  of  econometrics,  the  book  is  primarily  addressed  to  the  problems 
of  estimation  rather  than  to  the  testing  of  hypotheses.  It  is  concerned 
with  estimating,  from  the  insufficient  information  available,  the  values 
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or  magnitudes  of  the  variables  and  relationships  suggested  by  economic 
analysis.  The  maximum  likelihood  and  limited  information  tech- 
niques are  developed  from  fundamental  assumptions  and  criteria  and 
demonstrated  by  example;  their  costs  in  accuracy  and  computation 
are  weighed.  There  are  short  but  careful  treatments  of  identification, 
instrumental  variables,  factor  analysis,  and  hypothesis  testing.  The 
book  proceeds  much  more  by  statements  of  problems  and  examples  than 
by  the  development  of  mathematical  proofs. 

The  main  feature  of  this  book  is  its  pedagogical  strength.  While 
rigor  is  not  sacrificed  and  no  mathematical  or  statistical  rabbits  are 
pulled  out  of  the  author's  hat,  the  statistical  tools  are  always  presented 
in  terms  of  the  fundamental  limitations  and  criteria  of  the  real  world. 
Almost  every  concept  is  introduced  by  an  example  set  in  this  world  of 
real  problems  and  difficulties.  Mathematical  concepts  and  notational 
distinctions  are  most  often  handled  in  clearly  set  off  "digressions." 
The  fundamental  notions  of  probability  and  matrix  algebra  are 
reviewed,  but  in  general  it  is  assumed  that  the  student  has 
already  been  introduced  to  determinants  and  matrices  and  the 
elementary  properties  and  processes  of  differentiation.  (No  more 
knowledge  of  mathematics  is  required  than  for  any  of  the  other 
comparable  texts,  and,  thanks  to  the  pedagogical  skills  of  the  author, 
probably  considerably  less.)  Frequent  emphasis  is  placed  upon  com- 
putation design  and  requirements. 

Valavanis'  book  is  brilliantly  organized  for  classroom  presentation, 
most  of  the  statistical  and  mathematical  assumptions  and  concepts 
being  treated  verbally  and  by  example  before  they  appear  in  any 
mathematical  formulation.  In  addition  to  the  examples  used  in 
presentation,  there  are  exercises  in  almost  every  chapter. 

Seymour  E.  Harris 
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This  work  is  neither  a  complete  nor  a  systematic  treatment  of 
econometrics.  It  definitely  is  not  empirical.  It  has  one  unifying  idea: 
to  reduce  to  common-sense  terms  the  mathematical  statistics  on  which 
the  theory  of  econometrics  rests. 

If  anything  in  econometrics  (or  in  any  other  field)  makes  sensa,  one 
ought  to  be  able  to  put  it  into  words.  The  result  may  not  ba  so  com- 
pact as  a  close-knit  mathematical  exposition,  but  it  can  be,  In  its 
own  way,  just  as  elegant  and  clear. 

Putting  symbols  and  jargon  into  words  understandable  to  a  wider 
audience  is  not  the  only  thing  I  want  to  do.  I  think  that  watering 
down  a  highly  refined  or  a  very  deep  mathematical  argument  is  a 
useful  activity.  For  instance,  if  the  essence  of  a  problem  can  be 
captured  by  two  variables,  why  tackle  n?  Or  why  worry  about 
mathematical  continuity,  existence,  and  singularity  in  a  discussion  of 
economic  matters,  unless  these  intriguing  properties  have  interesting 
economic  counterparts?  We  would  be  misspending  effort  if  all  the 
reader  wants  is  an  intelligent  layman's  idea  of  what  is  going  on  in  the 
field  of  econometrics.  For  the  sake  of  the  punctilious,  I  shall  give 
warning  every  time  my  heuristic  "proof"  is  not  watertight  or  when- 
ever I  slur  over  an  unessential  mathematical  maze. 

Much  of  econometric  literature  suffers  from  overfancy  notation.  If 
I  judge  rightly,  many  people  quake  at  the  sight  of  every  new  issue  of 
Econometrica.  I  hope  to  show  them  the  intuitive  good  sense  that  hides 
behind  the  mathematical  squiggles. 
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Besides  restoring  the  self-assurance  of  the  ordinary  intelligent 
reader  and  helping  him  discriminate  between  really  important  develop- 
ments in  econometric  method  and  mere  mathematical  quibbles,  I  have 
tried  to  be  useful  to  the  teachers  of  econometrics  and  peripheral 
subjects  by  supplying  them  with  material  in  "pedagogic"  form.  And 
lastly,  I  should  like  to  amuse  and  surprise  the  serious  or  expert  eco- 
nometrician,  the  connoisseur,  by  serving  him  familiar  hash  in  nice  new 
palatable  ways  but  without  loss  of  nutrient  substance. 

The  gaps  in  this  work  are  intentional.  One  cannot,  from  this  book 
alone,  learn  econometrics  from  the  ground  up;  one  must  pick  up 
elementary  statistical  notions,  algebra,  a  little  calculus — even  some 
econometrics — elsewhere. 

For  the  beginner  in  econometrics,  an  approximately  correct  sequence 
would  be  the  books  of  Beach  (1957),  Tinbergen  (1951),  Klein  (1953), 
and  Hood  (1953);  with  Tintner  (1942)  as  a  source  of  examples  or 
museum  for  the  numerous  varieties  of  quantitative  techniques  in 
existence.  Tinbergen  emphasizes  economic  policy;  Klein,  the  busi- 
ness cycle  and  macro-economics;  Tintner,  the  testing  of  hypotheses 
and  the  analysis  of  time  series.  All  three  use  interesting  empirical 
examples.  For  elementary  mathematics  the  first  part  of  Beach 
(perhaps  also  Klein,  appendix  on  matrices)  is  enough.  Reference  to 
all  these  easily  available  and  digestible  texts  is  meant  to  avoid  my 
repeating  what  has  been  said  by  others. 

From  time  to  time,  however,  I  make  certain  "digressions";  these 
are  held  in  from  the  margins.  These  digressions  have  to  do  mostly 
with  mathematical  and  statistical  subjects  that  in  my  opinion  are 
either  inaccessible  or  not  well  explained  elsewhere. 

Stefan  Valavanis 
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CHAPTER  1 


The  fundamental  proposition 
of  econometrics 


1.1.  What  econometrics  is  about 

An  econometrician's  job  is  to  express  economic  theories  in  mathe- 
matical  terms  in  order  to  verify  them  by  statistical  methods,  and  to 
measure  the  impact  of  one  economic  variable  on  another  so  as  to  be 
able  to  predict  future  events  or  advise  what  economic  policy  should  bs 
followed  when  such  and  such  a  result  is  desired. 

This  definition  describes  the  major  divisions  of  econometrics,  namely, 
specification,  estimation,  verification,  and  prediction. 

Specification  has  to  do  with  expressing  an  economic  theory  in  mathe- 
matical terms.  This  activity  is  also  called  model  building.  A  model 
is  a  set  of  mathematical  relations  (usually  equations)  expressing  an 
economic  theory.  Successful  model  building  requires  an  artist's  touch, 
a  sense  of  what  to  leave  out  if  the  set  is  to  be  kept  manageable,  elegmit, 
and  useful  with  the  raw  materials  (collected  data)  that  are  available. 
This  book  deals  only  incidentally  with  the  "specification"  aspect  of 
econometrics. 

The  problem  of  estimation  is  to  use  shrewdly  our  all  too  scanty  data, 
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so  as  to  fill  the  formal  equations  that  make  up  the  model  with  numerical 
values  that  are  good  and  trustworthy.  Suppose  we  have  the  following 
simple  theory  to  quantify:  Consumption  today  (C)  depends  on  yester- 
day's income  (Z)  in  such  a  way  that  equal  increments  of  income,  no 
matter  what  income  level  you  start  from,  always  bring  equal  increments 
in  consumption.  Letting  a  stand  for  consumption  at  zero  income  and 
7  for  the_ marginal^ pjopensjty_tQ„iiQnsume,  this  theory  can  be  expressed,, 
thus: 

Ct  =  a  +  yZt  (1-1) 

The  problem  of  estimation  (and  the  main  concern  of  this  book)  is  to 
discover  how  to  use  whatever  experience  we  have  about  consumption 
C  and  income  Z  in  order  to  make  a  shrewd  guess  about  how  large  a 
and  7  mi^.ht  really  be.  The  problem  of  estimation  is  to  guess  correctly 
a  and  7,  t  he  'parameters  (or  inherent  characteristics)  of  the  consumption 

junction. 

J&rinLe&timaU&n  is  making  the  best  possible  single  guess  about  a  and 
about  7.  Interval  estimation  is  guessing  how  far  ojir^uessjo^^may  be 
from  the  true  ^J^nd  our  guess  of  7  from  the  true_x* 

It  is  not  enough,  of  course,  to  be  able  to  make  correct  point  and 
interval  estimates.  We  want  to  make  them  as  cheaply  as  possible. 
Wg.HSJl^^^^^^Lfej^cfe.Wt.  programming  of  computations ,  checks  of 
accuracy,  and  shozLSUlS-  Though  this  aspect  of  estimation  will  not 
occupy  us  very  much,  I  shall  give  some  computational  advice  from 
time  to  time. 

Verification  sets  up  criteria  of  success  and  uses  these  criteria  to 
accept  or  reject  the  economic  theory  we  are  testing  with  the  model  and 
ourvdata.  It  is  a  tricky  subject  deeply  rooted  in  the  mathematical 
theory  of  statistics. 

Prediction  involves  rearranging  the  model  into  convenient  shape,  so 
that  we  can  feed  it  information  about  new  developments  in  exogenous 
and  lagged  variables  and  grind  out  answers  about  the  impact  of  these 
variables  on  the  endogenous  variables. 

1.2.  Mathematical  tools 

In  explaining  how  to  fashion  good  estimates  ft»r  the  parameters  of  an 
econometric  model  I  shall  often  step  into  th$  mathematical  statis- 
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tician's  toolroom  to  bring  out  one  gadget  or  another  required  by  the 
next  step  of  our  procedure.  These  digressions  are  clearly  marked  m 
they  can  be  skipped  by  those  acquainted  with  the  tool  in  question. 

The  mathematical  tools  used  again  and  again  are  elementary: 
analytic  geometry,  which  makes  equations  and  graphs  interchangeably; 
probability,  a  concept  enabling  us  to  make  precise  statements ..about 
uncertain  events;  the  derivative  (or  the  operation  of  differentiating), 
which  is  a  help  in  making  a  "best"  guess  among  all  possible  guessej; 
moments,  which  are  a  sophisticated  way  of  averaging  various  magni- 
tudes; and  matrices,  which  are  nothing  but  many-dimensional  ordinary 
numbers — indeed,  statements  that  are  true  of  ordinary  numbers 
seldom  fail  for  matrices — for  instance,  you  can  add,  subtract,  multiply, 
and  divide  matrices  analogously  to  numbers  and,  in  general,  handle 
them  as  if  they  were  ordinary  numbers  though  perhaps  more  fragile;  a 
vector  is  a  special  kind  of  matrix. 

1.3.  Outline  of  procedure  and  main  discoveries  in  the 
next  hundred  pages 

I.  We  shall  deal  first  with  models  consisting  of  a  single  equation. 
We  shall  find  that  even  in  this  simple  case  there  are  important 
difficulties. 

A.  It  is  not  always  possible  to  estimate  the  parameters  of  even  a 
single-equation  model,  for  two  sorts  of  reason: 

1.  f  e  may  lack  enough data.     This  is  called  the  'problem  of 
degrees  of  freedom. 

2.  Though  the  data  are  plentiful,  they  may  not  be  rich  or  varied 
enough.     This  is  the  problem  of '  multicollinegjtifa. 

B.  Our  second  important  finding  will  be  that  "pedestrian" 
methods  of  estimation,  for  example  the  least  squares  fit,  are 
apt  to  be  treacherous.  They  either  give  us  erroneous  impres- 
sions about  the  true  values  of  the  parameters  or  waste  the  data. 

II.  Turning  then  to  models  containing  two  or  more  equations,  our 

main  findings  will  be  the  following: 

A.  It  is  sometimes  impossible  to  determine  the  value  of  each 
parameter  in  each  equation,  but  this  time  not  merely  for  lack 
of  data  or  their  monotony,  but  rather  because  the  equations 
look  too  much  like  one  another  to  be  disentangled.     Econ- 
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ometricians  call  this  undesirable  property  lack  of  identifiability. 

B.  Nonpedestrian,  statistically  sophisticated  methods  become  very 
complex  and  costly  to  compute  when  the  model  increases  from 
a  single  equation  even  to  two. 

C.  Happily,  however,  by  sacrificing  some  of  the  rigor  of  these  ideal 
"  equestrian "  methods  in  special,  shrewd  ways,  we  can  cut 
the  burden  of  computation  by  a  factor  of  5  or  10  and  still  get 
pretty  good  results.  Such  techniques  are  called  limited 
i^ormation^Jzchmj&uej,  because  they  deliberately  disregard 
refinements  that  should  ideally  be  taken  into  account.  Most 
theoretical  econometricians  work  in  this  field,  because  the  need 
is  very  great  to  know  not  only  how  to  boil  down  complexity 
with  clever  tricks  but  also  precisely  how  much  each  trick  costs 
us  in  accuracy. 

1*4.  All-importance  of  statistical  assumptions 

The  key  word  in  estimation  is  the  word  stgdwslic.     ItSwQPJiQSJteJis^ 
exact  or  systematic. 

Stochastic  comes  from  the  Greek  stokhos  (a  target,  or  bull's-eye). 
Tjie  outcome  of  throwing  darts  is  a  stochastic  process,  that  is  to  say, 
fraught  with  occasional  misses.  In  economics,  indeed  in  all  empirical 
disciplines,  we  do  not  expect  our  predictions  to  hit  the  bull's-eye 
100  per  cent  of  the  time. 

Econometrics  begins  by  saying  a  great  deal  more  about  this  matter 
of  missing  the  mark.  Where  ordinary  economic  theory  merely  recog- 
nizes that  we  miss  the  mark  now  and  then,  econometrics  makes 
statistical  assumptions.  These  are  precise  statements  about  the  par- 
ticular way  the  darts  hit  the  target's  rim  or  hit  the  wall.  Everything — 
estimation,  prediction,  and  verification — depends  vitally  on  the  content 
of  the  statistical  assumptions.  Econometric  models  emphasize  this 
TacTby  using  a  special  variable jw  called  the  error  term.  The  error 
term  varies  from  instance  to  instance,  just  as  one  dart  falls  above, 
another  below,  one  to  the  left,  another  to  the  right,  of  the  target.  A 
subscript  t  serves  to  indicate  the  various  values  of  the  error  term.  To 
make  model  (1-1)  stochastic,  we  write 

Ct  =  a  +  yZt  +  ut  (1-2) 
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Before  going  on  to  rationalize  the  presence  of  the  error  term  u  m 
equation  (1-2),  two  things  must  be  explained.  First,  u  could  have  been 
included  as  a  multiplicative  factor  or  as  an  exponential  rather  than  as 

an  additive  term.    Second^  its  subscript  t  need  not  express  time. It 

can  refer  just  as  well  to  various  countries  or  income  classes. 

To  facilitate  the  exposition,  I  shall  henceforth  treat  ut  as  an  additive 
term  and  take  /  to  represent  time.  Exceptions  will  be  clearly  labeled. 
The  common-sense  interpretation  of  additivity  is  deferred  to  Sec.  1.11. 

1.5.  Rationalization  of  the  error  term 

There  are  four  types  of  reasons  why  an  econometric  model  should  be 
stochastic  and  not  exact:  incomplete  theory,  imperfect  specification^ 
aggregation  of  data,  and  errors  of  measurement.  Not  all  of  them 
apply  to  every  model. 

1.  Incomplete  theory 

A  theory  is  necessarily  incomplete,  an  abstraction  that  cannot 
explain  everything.     For  instance  Jaimoimple  theory  of  consumption*, 

a.  We  have  left  out  possible  variables,  like  wealth  and  liquid  assets^ 
that  also  affect  consumption. 

b.  We  have  left  out  equations.  The  economy  is  much  more  complex 
than  a  single  equation,  no  matter  how  many  explanatory  variables  this 
single  equation  may  contain;  there  may  be  other  links  between  con- 
sumption and  income  besides  the  consumption  function.1 

c.  Human  behavior  is  "ultimately"  random. 

2.  Imperfect  specification 

We  have  linearized  a  possibly  nonlinear  relationship. 

3.  Aggregation  of  data 

We  have  aggregated  over  dissimilar  individuals.    Even  if  each  of 

them  possessed  his  own  a  and  y  and  if  his  consumption  reacted  in 

exact  (nonstochastic)  fashion  to  his  past  income,  total  consumption 

would  not  be  likely  to  react  exactly  in  response  to  a  given  total  income, 

because  its  distribution  may  change.     Another  way  of  putting  this  is: 

1  How  many  independent  links  there  may  be  and  how  we  are  to  find  them  is 
itself  a  problem  in  statistical  inference,  and  is  treated  briefly  in  Chap.  10. 
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Variables  expressing  individual  peculiarities  are  missing  (cf.  la).  Or 
this  way:  Equations  that  describe  income  distribution  are  missing 
(cf.  16). 

4.  Errors  of  measurement 

Even  if  behavior  were  exact,  survey  methods  are  not,  and  our 
statistical  series  for  consumption  and  income  contain  some  errors  of 
measurement.  Throughout  this  book  we  pretend  that  all  variables  are 
measured  without  error. 

1.6.  The  fundamental  proposition 

All  we  get  out  of  an  econometric  model  is  already  implied  in: 

1.  Its  specification;  that  is  to  say,  consumption  C  depends  on  yester- 
day's income  Z  as  in  the  equation  C%  =  a  +  yZt  +  ut  and  in  no  other 
way1 

2.  Our  assumptions  concerning  u,  that  is  to  say,  the  particular  way 
we  suppose  the  relationship  between  C  and  Z  to  be  inexact2 

3.  Our  sampling  procedure,  namely,  the  way  we  arrange  to  get  data 

4.  Our  sample,  i.e.,  the  particular  data  that  happen  to  turn  up  after 
we  decide  how  to  look  for  them 

5.  Our  estimating  criterion,  i.e.,  what  properties  we  desire  our 
estimates  of  a  and  7  to  have,  short  of  the  unattainable:  absolute 
correctness 

Over  items  1,  2,  3,  and  5  we  have  absolute  control;  for  we  are  free  to 
change  our  theory  of  consumption,  our  set  of  assumptions  concerning 
the  error  term,  our  data-collecting  techniques,  and  our  estimating 
criterion.  We  have  no  control  over  item  4;  for  what  data  actually 
turn  up  is  a  matter  of  luck. 

According  to  this  fundamental  proposition,  what  estimates  we  get 

for  the  parameters  (a  and  7)  depends,  among  other  things,  on  the 

stochastic  assumptions,  i.e.,  what  we  choose  to  suppose  about  the 

behavior   of   the   error  term  u.     Every  set  of  assumptions  about 

the  error  term  prescribes  a  certain  way  of  guessing  at  the  true  value  of 

the  parameters.     And  conversely,  every  guess  about  the  parameters  is 

Implicitly  a  set  of  stochastic  assumptions. 

!  This  is  also  called  "structural  specification." 
5  This  is  also  called  "stochastic  specification." 
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The  relationship  between  stochastic  assumptions  and  parameter 
estimates  is  not  always  a  one-to-one  relationship.  A  given  set  of 
stochastic  assumptions  is  compatible  with  several  sets  of  different 
parameter  estimates,  and  conversely.  In  practice,  we  don't  have  to 
worry  about  these  possibilities,  because  we  shall  be  making  assump- 
tions about  u  that  lead  to  unique  guesses  about  a  and  7,  or,  at  the 
very  worst,  to  a  few  different  guesses.  Also,  in  practice,  since  we 
usually  are  interested  in  a  and  7,  not  in  verifying  our  assumptions 
about  u,  it  does  not  matter  that  many  different  u  assumptions  are 
compatible  with  a  single  set  of  parameter  estimates. 

1,7.  Population  and  sample 

The  whole  of  statistics  rests  on  the  distinction  between  population 
and  observation  or  sample.  People  were  receiving  income  and  con- 
suming it  long  before  econometrics  and  statistics  were  dreamed  of. 
There  is,  so  to  speak,  an  underlying  population  of  C's  and  Z's,  which 
we  can  enumerate,  hypothetically,  as  follows: 


C\,  C2,  . 

.  .  ,  Cp,  . 

.  ,cP 

or 

Cpforp  -  1,  2,  .  .  .  ,P 

Zi,  Z2,  . 

•  >  Zpi  ' 

. .  ,ZP 

or 

Zp  for  p  -  1,  2,  .  .  .  ,  P 

Of  these  C's  and  Z'b  we  may  have  observed  all  or  some.  Those  that 
we  have  observed  take  on  a  different  index  s  instead  of  p,  to  emphasize 
that  they  form  a  subset  of  the  population.  All  we  observe,  then, 
is  C,  and  Z„  where  the  s  assumes  some  (perhaps  all)  of  the  values  that 
p  runs,  but  no  more.  Index  s  can  start  running  anywhere,  say,  at 
p  =  5,  assume  the  values  6  and  7,  skip  8,  25,  and  92,  and  stop  anywhere 
short  of  the  value  P  or  at  P  itself.  In  all  cases  that  I  shall  discuss, 
the  sample  covers  consecutive  time  periods,  which  are  renumbered,  for 
convenience,  in  such  a  way  that  the  beginning  of  time  coincides  with 
the  beginning  of  the  sample,  not  of  the  population.  Whether  the 
sample  is  consecutive  or  not  sometimes  does  and  sometimes  does  not 
affect  the  estimation  of  a  and  7. 

Note  that  we  mean  by  the  term  sample  a  given  collection  of  observe* 
tions,  like  S  —  (C$>,CioAi;  Z9)Zio,Zu),  not  an  isolated  observation. 
S  is  a  sample  of  three  observations,  the  following  ones:  {C^Z^)}  (Cio,^ie), 
and  (Cn,Zii).  Samples  made  up  of  a  single  observation  can  exist,  of 
course,  but  we  seldom  work  with  so  little  observation. 
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1.8.  Parameters  and  estimates 

Another  crucial  distinction  is  between  a  parameter  and  its  estimate. 
If  the  theory  is  correct,  there  are,  hiding  somewhere,  the  true  a  and  7. 
These  we  never  observe.  What  we  do  do  is  guess  at  them,  basing 
ourselves  on  such  evidence  and  common  sense  as  we  may  have.  The 
guesses,  or  estimates,  always  wear  a  badge  to  distinguish  them  from 
the  parameters  themselves. 

CONVENTION 

We  shall  use  three  kinds  of  badge  for  a  parameter  estimate:  a  roof- 
shaped  hat,  as  in  4,  i,  to  mark  maximum  likelihood  estimates;  a  bird, 
the  same  symbol  upside  down,  as  in  a,  7,  for  naive  least  squares  esti- 
mates; and  the  wiggle,  as  in  a,  7,  for  other  kinds  of  estimates  or  for 
estimates  whose  kind  we  do  not  wish  to  specify.  These  types  of 
estimate  are  defined  in  Chap.  2. 

The  distinction  between  error  and  residual  is  analogous  to  the 
distinction  between  parameter  and  estimate.  The  error  ut  is  never 
observed,  although  we  may  speculate  about  its  behavior.  It  always 
goes  with  the  real  a  and  7,  as  in  (1-2),  whereas  the  residual,  which  is  an 
estimate  of  the  error  and  whose  symbol  always  wears  a.  distinctive 
badge,  can  be  calculated,  provided  we  have  settled  on  a  particular 
guess  (<2,f )  or  (a, 7)  for  the  parameters.  The  value  of  the  error  does 
not  depend  on  our  guessing;  it  is  just  there,  in  the  population  and, 
therefor©,  in  the  sample.  The  residual,  however,  depends  on  the 
particular  guess.  To  emphasize  this  fact  we  put  the  same  badge  on 
the  residual  as  on  the  corresponding  parameter  estimate.  We  write, 
for  example,  Ct  =  &  +  ^Zt  +  Ut  or  Ct  =  a  -J-  yZt  +  Ut. 

Now  we  can  state  precisely  what  the  problem  of  estimation  is,  as 
follows.  We  assume  a  theory,  for  example,  the  theory  of  consumption 
C$  =-  a  4-  7^1  +  ut;  we  assume  that  Ut  behaves  in  some  particular 
way  (to  which  the  next  section  is  devoted) ;  we  get  a  set  of  observations 
on  0  and  Z  (the  sample).  Then  we  manipulate  the  sample  data  to 
give  us  estimates  a  and  7  that  satisfy  our  estimating  criterion  (dis- 
cussed in  Chap.  2).  Then  we  compute  if  we  wish  the  residuals  Ut  as 
Climates  of  the  errors  ut. 
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1.9.  Assumptions  about  the  error  term 

Besides  additivity  (Sec.  1.4)  we  shall  now  make  and  interpret  m% 
assumptions  about  the  error  term.  Of  these,  the  first  is  indispensable, 
exactly  as  it  stands.  The  other  five  could  be  different  in  content  or  in 
number.    Note  carefully  that  these  are  statements  about  the  u's  not 

the  #'s. 

Assumption  1.    u  is  a  random  real  variable. 

If  the  model  is  stochastic,  either  its  systematic  variables  (eonsump* 
tion  or  income)  are  measured  with  errors,  or  the  consumption  function 
itself  is  subject  to  random  disturbances,  or  both.  Since  we  have  ruled 
out  (Sec.  1.5)  errors  of  measurement,  the  relationship  itself  km  to  \m 
stochastic. 

" Random"  is  one  of  those  words  whose  meaning  everybody  knows 
but  few  can  define.  Unpardonably,  few  standard  texts  glv©  its 
definition.  A  variable  is  random  if  it  takes  on  a  number  of  different 
values,  each  with  a  certain  probability.  Its  different  values  can  be 
infinite  in  number  and  can  range  all  over  the  field,  provided  th@r<§  are 
at  least  two  of  them.  For  instance,  a  variable  w  that  is  equal  to 
—  y%  twenty-five  per  cent  of  the  time,  to  3  +  \/2  forty  per  cent  of  the 
time,  and  to  +35.3  thirty-five  per  cent  of  the  time  is  a  random  variable. 
We  may  or  may  not  know  what  values  it  takes  on  or  their  probabili- 
ties. Its  probability  distribution  may  or  may  not  be  expressible 
analytically.     (See  Sec.  2.2.) 

Digression  on  the  distinction  between  probability 
and  probability  density 

A  random  variable  w  can  be  discrete^  like  the  number  of  remain- 
ing teeth  in  members  of  a  group  of  people,  or  continuoxm,  Ukf  their 
weight.  If  w  takes  on  a  finite  number  of  values,  their  probabili- 
ties can  be  quite  simply  plotted  on  and  read  off  a  dot  di®%?&m  or 
point  graph  (Fig.  la). 

With  a  continuous  variable  we  can  usually  speak  only  of  the 
probability  that  its  value  should  lie  between  (or  at)  such  and 
such  limits.     In  this  case,  we  plot  a  probability-density  graph 
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(Fig.  16).  The  height  of  such  a  graph  at  a  point  is  the  probability 
density }  and  the  relative  area  under  the  graph  between  any  two 
points  of  the  w  axis  is  the  probability. 

Assumption  2.     ut,  for  every  t,  has  zero  expected  value. 

Naively  interpreted,  this  proposition  says  that  the  "average"  value 
of  Wi  is  zero,  that  of  u2  is  also  zero,  and  so  forth.  Or,  to  put  it  dif- 
ferently, it  says  that  a  prediction  like  C\  =  a  +  yZi  is  "on  the 
average"  correct,  that  the  same  is  true  of  C2  =  a  +  7^2,  and  so  forth. 
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Fsg.  1.  A  random  variable,  a.  Variable  w  is  discrete*  The  illustration  is  a  dot 
diagram,  or  point  graph,  b.  Variable  w  is  continuous.  The  illustration  is  a 
probability-density  graph. 


The  trouble  begins  when  you  begin  to  wonder  what  "on  the  average" 
can  possibly  mean  if  you  stick  to  a  single  instance,  like  time  period  1 
(or  time  period  2).  For  every  event  happens  in  a  particular  way  and 
not  otherwise.  Suppose,  for  instance,  that  in  the  year  1776  (t  —  1776) 
consumption  fell  short  of  its  theoretical  level  a  +  yZme  by  2.24 
million  dollars,  that  is,  that  Ume  =  —2.24.  Obviously,  then,  the 
average  value  of  Wme  is  exactly  —2.24.  What  could  we  possibly  wish 
to  convey  by  the  statement  that  Wine  (and  every  other  ut)  has  zero 
expected  value? 

One  should  never  identify  the  concept  of  expected  value  with  the 
concept  of  the  arithmetic  mean.  Arithmetic  mean  denotes  the  sum  of 
a  set  of  N  numbers  divided  by  N  and  is  an  algebraic  concept.     Expected 
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value  is  a  statistical  or  combinatorial  concept.  You  have  to  imagine  an 
urn  containing  a  (finite  or  infinite)  number  of  balls,  each  with  a  number 
written  on  it.  Consider  now  all  the  possible  ways  one  could  draw  one 
ball  from  such  an  urn.  The  arithmetic  mean  of  the  numbers  that 
would  turn  up  if  we  exhausted  all  possible  ways  of  drawing  one  ball  is 
the  expected  value. 

The  random  term  of  an  econometric  model  is  assumed  to  come  from 
an  Urn  of  Nature  which,  at  every  moment  of  time,  contains  bails  with 
numbers  that  add  up  to  zero. 

The  common-sense  interpretation  of  Nature's  Urn  is  as  follows: 
Though  in  1776  actual  consumption  in  fact  fell  short  of  the  theoretical 
by  2.24  and  no  other  amount,  the  many  causes  that  interacted  to 
produce  Wme  =  —2.24  could  have  interacted  (in  1776)  in  various  other 
ways.  This,  theoretically,  they  were  free  to  do  since  they  were 
random  causes.  Now,  try  to  think  of  all  conceivable  combinations  of 
these  causes — or  if  you  prefer,  think  of  very  many  1776  worlds, 
identical  in  all  respects  except  in  the  combinations  of  random  causes 
that  generated  the  random  term.  Let  us  have  as  many  such  worlds 
as  there  are  theoretical  combinations  of  the  causes  behind  the  random 
term.  In  some  worlds  the  causes  act  overwhelmingly  to  make  con- 
sumption lower  than  the  nonstochastic  level  a  -f-  T^me J  in  other 
worlds  the  causes  act  so  as  to  make  it  greater  than  a  +  7^n?eJ  and  in  a 
few  worlds  the  causes  cancel  out,  so  that  Cm*  =  «  +  T^me  exactly. 
Now  consider  the  random  terms  of  all  possible  worlds,  and  (says  the 
assumption)  they  will  average  out  to  zero. 

This  interpretation  is  a  conceptual  model  we  can  never  hope  to 
prove  or  disprove.  Its  chief  merit  is  that  it  reduces  chance  and 
statistics  to  the  (relatively)  easy  language  and  theorems  of  com- 
binatorial algebra.  Some  people  take  it  seriously;  others  (myself 
included)  use  it  for  lack  of  anything  better. 

Digression  on  the  distinction  between  "population" 


Whether  or  not  we  take  Nature's  Urn  seriously,  we  will  be  well 
advised  to  acknowledge  that  we  are  dealing  with  three  levels  of 
discourse,  not  just  the  two  that  I  called  population  and  sample. 
The  third  and  deeper  level  is  called  the  universe.     It  contains  all 
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events  as  they  have  happened  and  as  they  might  have  happened  if 
everything  else  had  remained  the  same  but  the  random  shocks. 

Level  I    Sample :  things  that  both  happened  and^were  observed. 
It  is  drawn  from 
Level  II    Population :  things  that  happened  but  were  not  neces- 
sarily observed.     It  is  drawn  from 
Level  III    Universe:  all  things  that  could  have  happened.     (In 
the  nature  of  things  only  a  few  did.) 

CONVENTION 

We  shall  henceforth  use  four  types  of  running  subscript: 


8  -  1,  2,   .   . 

•  ,8 

for  the  sample 

p  =  1,  2,  .  . 

■  ,P 

for  the  population 

i=l,2,.. 

■  ,1 

for  the  universe 

t  m  1,  2,   .   . 

•  ,T 

for  instances  in  general,  whether 
they  come  from  the  sample,  the 
population,  or  the  universe 

In  a  sense  the  population  (CP,ZP)  of  consumption  and  income 
as  they  actually  happened  in  recorded  and  unrecorded  history  is 
merely  a  sample  from  the  universe  (Ci}Zt)  of  all  possible  worlds. 
Naturally,  what  we  call  the  sample  is  drawn  from  the  population 
of  actual  events,  not  from  the  hypothetical  universe  of  level  III. 
In  most  instances  it  does  no  harm  to  speak  (and  prove  theorems) 
as  if  level  I  were  picked  directly  from  level  III,  not  from  level  II. 

The  Platonic  universe  of  level  III  is  indeed  rather  unseemly  for 
the  field  of  statistics  (which  is  surely,  in  lay  opinion,  the  most 
hard-boiled  of  mathematics,  resting  solidly  on  " facts")  and  has 
been  amusingly  ridiculed  by  Hogben.1 

The  next  few  paragraphs  state  why  the  abstract  model  of 
Nature's  Urn  is  a  less  appropriate  foundation  for  econometrics 
than  for  statistical  physics  or  biology.  But  the  rest  of  the  book 
goes  merrily  on  using  the  said  Urn. 

Economic  and  physical  phenomena  alike  take  place  in  time. 

1  Lancelot  Hogben,  F.  R.  S.,  Statistical  Theory:  The  Relationship  of  Probability, 
Credibility,  and  Error;  An  Examination  of  the  Contemporary  Crisis  in  Statistical 
Theory  from  a  Behaviourist  Viewpoint,  pp.  98-105  (London:  George  Allen  &  Unwin 
Ltd.,  1957). 
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In  both  fields,  the  statement  that  ut  is  a  random  variable  for  each 
t  is  inevitably  an  abstraction,  because  time  runs  on  inexorably. 
In  the  physical  sciences  events  are  deemed  "repeatable,"  or  aris- 
ing from  a  universe  "fixed  in  repeated  samples,"  primarily 
because  the  experimenter  can  ideally  replicate  exactly  all  system- 
atically significant  conditions  that  had  surrounded  his  original 
event.  This  is  not  possible  in  social  phenomena  of  the  "  irre- 
versible" or  "progressive"  type.  Although  in  the  physical 
sciences  it  may  be  safe  to  neglect  the  difference  between  popula- 
tion and  universe,  it  is  unsafe  in  econometrics.  For,  as  economic 
phenomena  take  place  in  time,  all  other  conditions,  including 
the  exogenous  variables,  move  on  to  new  levels,  often  never  to 
return.  The  common-sense  phrase  " on  the  average  over  similar 
experiments"  makes  much  more  sense  in  a  laboratory  science 
than  in  economics. 

Nature's  Urn  also  supports  maximum  likelihood,  variance  of  an 
estimate,  bias,  consistency,  and  many  other  notions  we  shall  have 
occasion  to  introduce  in  later  chapters.  All  these  rest  on  the 
notion  of  "all  conceivable  samples."  The  class  of  all  conceivable 
samples  includes  first  of  all  samples  of  all  conceivable  sizes;  it  also 
includes  all  conceivable  samples  of  a  given  size^  say,  4.  A  sample 
of  size  4  may  consist  of  points  that  actually  happened  (if  so,  they 
are  in  the  population) ;  it  also  could  consist  (partly  or  entirely) 
of  points  in  the  universe  but  not  in  the  population.  The  latter 
kind  of  sample  is  easy  to  conceive  but  impossible  to  draw,  because 
the  imagined  points  never  "happened."  Therefore,  even  a  com- 
plete census  of  what  happened  is  not  enough  for  constructing  an 
exhaustive  list  of  all  conceivable  samples. 

Assumption  3.     The  variance  of  ut  is  constant  over  time. 

This  means  merely  that,  in  each  year,  ut  is  drawn  from  an  identical 
Urn,  or  universe.  This  assumption  states  that  the  causes  underlying 
the  random  term  remain  unchanged  in  number,  relative  importance, 
and  absolute  impact,  although,  in  any  particular  year,  one  or  another 
of  them  may  fail  to  operate. 

For  simplicity's  sake  we  have  assumed  no  errors  of  measurement. 
In  fact  there  may  be  some,  and  their  typical  size  could  vary  system- 
atically with  time  (or  with  the  independent  variable  Z).     If  we  try 
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to  measure  the  diameter  of  a  distant  star,  our  error  of  measurement  is 
likely  to  be  several  million  miles;  when  we  measure  the  diameter  of 
Sputnik,  it  can  be  only  a  few  feet.  Likewise,  if  our  data  stretch  from 
3850  to  1050,  national  income  increases  by  a  factor  of  20.  It  is  quite 
likely  that  errors  of  measurement,  too,  should  increase  absolutely. 
If  they  do,  Assumption  3  is  violated,  and  some  of  the  techniques  that 
I  develop  below  should  not  be  used. 

Digression  on  infinite  variances 

The  variance  of  u  is  not  only  constant  but  finite.  When  u  is 
normal  it  is  unnecessary  to  stipulate  that  its  variance  is  finite, 
because  all  nondegenerate  normal  distributions  have  finite  vari- 
ance. There  exist,  however,  nondegenerate  distributions  with 
zero  mean  and  infinite  variance,  for  example,  the  discrete  random 
variable 

.  .  .  ,     -16,     -8,     -4,      4,      8,       16, 

with  probabilities,  respectively, 

U»        \/«        U      V      l 


/ 


16 


H,    K,    H,    He, 


The  central  limit  theorem,  according  to  which  the  sum  of  N 
random  variables  (distributed  in  any  form)  approaches  the 
normal  distribution  for  large  Nt  is  valid  only  if  the  original  distri- 
butions have  finite  variances. 

Assumption  4.     The  error  term  is  normally  distributed. 

This  is  a  rather  strong  restriction.     We  impose  it  mainly  because 
normal  distributions  are  easy  to  work  with. 

Digression  on  the  univariate  normal  distribution 

The  single- variable  normal  distribution  is  shaped  like  a  sym- 
metrical two-dimensional  bell  whose  mouth  is  wider  than  the 
mouth  of  anything  you  might  name.  Normal  distributions  come 
tall,  medium,  and  squat  (i.e.,  with  small,  medium,  or  large  vari- 
ances).   And  the  top  of  the  bell  can  be  over  any  point  of  the 
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w  axis;  that  is  to  say,  the  mean  of  the  normal  can  be  negative, 
zero,  or  positive,  large  or  small.  This  distribution's  chief  charac- 
teristic is  that  extreme  values  are  more  and  more  unlikely  the 
more  extreme  they  get. 

For  instance,  the  likelihood  that  all  the  people  christened 
John  and  living  in  London  will  die  today  is  extremely  small,  and 
the  likelihood  that  none  of  them  will  die  today  is  equally  small. 
Now  why  is  this?  Because  London  is  not  under  atomic  attack, 
the  Johns  are  not  all  aboard  a  single  bus,  not  all  of  them  are  diving 
from  the  London  Bridge,  nor  were  they  all  born  85  years  ago. 
Each  goes  about  his  business  more  or  less  independently  of  the 
others  (except,  perhaps,  father-and-son  teams  of  Johns),  some  old, 
some  young,  some  exposing  themselves  to  danger  and  others  not. 
The  reason  why  the  probability  that  w  of  these  Johns  will  die 
today  approximates  the  normal  is  that  there  are  very  many  of 
them  and  that  each  is  subjected  to  a  vast  number  of  independent 
influences,  like  age,  food,  heredity,  job,  and  so  forth.  This 
probability  would  not  be  normal  if  the  Johns  were  really  few, 
if  the  causes  working  toward  their  deaths  were  few,  or  if  such 
causes  were  many  but  linked  with  one  another. 

The  assumption  that  u  is  normal  is  justified  if  we  can  show  that  the 
variables  left  out  of  equation  (1-2)  are  infinitely  numerous  and  not 
interlinked.  If  they  are  merely  very  many  and  not  interlinked,  then 
u  is  approximately  normal.  If  they  are  infinitely  many  but  enough 
of  them  are  interlinked,  then  u  is  not  even  approximately  normal. 
We  often  know  or  suspect  that  these  variables,  such  as  wealth,  liquid 
balances,  age,  residence,  and  so  forth,  are  quite  interlinked  and  are 
very  likely  to  be  present  together  or  absent  together. 

Sometimes  the  following  argument  is  advanced:  In  our  model  of 
consumption,  the  error  term  u  stems  from  many  sources;  that  is,  we 
have  left  out  variables,  we  have  left  out  equations,  we  have  linearized, 
we  have  aggregated,  and  so  on.  These  are  all  different  operations, 
presumably  not  linked  with  one  another.  Therefore,  u  is  normally 
distributed. 

This  argument  is,  of  course,  a  bad  heuristic  argument,  and  it  does 
not  even  stand  for  an  existing  (but  difficult)  rigorous  argument.    It 
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Is  logically  untidy  to  count  as  arguments  for  the  normality  of  u,  on  one 
and  the  same  level  of  discourse,  such  diverse  items  as  the  fact  of 
linearization  and  the  number  of  unspecified  variables  that  affect 
consumption. 

The  assumption  stands  or  falls  on  the  argument  of  many  non- 
interlinked  absent  variables.  Most  alternative  assumptions  cause 
great  computational  grief. 

Assumption  5.     The  random  terms  of  different  time  periods 
are  independent. 

This  assumption  requires  that  in  each  period  the  causes  that  deter- 
mine the  random  term  act  independently  of  their  behavior  in  ail 
previous  and  subsequent  periods.     It  is  easy  to  violate  this  assumption. 

1.  The  error  term  includes  variables  that  act  cyclically.  If,  for 
example,  we  think  consumption  has  a  bulge  every  3  years  because  that 
is  how  often  we  get  substantially  remodeled  cars,  this  effect  should  be 
introduced  as  a  separate  variable  and  not  included  in  u. 

2.  The  model  is  subject  to  cobweb  phenomena.  Suppose  that 
consumers  in  year  1  (for  any  reason)  underestimate  their  income,  so 
that  they  consume  less  than  the  theoretical  amount.  Then  in  year  2 
they  discover  the  error  and  make  it  up  by  consuming  more  than  the 
theoretical  amount  of  year  2;  and  so  on. 

3.  One  of  the  causes  behind  the  random  term  may  be  an  employee's 
vacation,  which  is  usually  in  force  for  2  weeks  though  the  model's 
unit  period  is  1  week.  Any  such  behavior  violates  the  requirement 
that  the  error  of  any  period  be  independent  of  the  error  in  all  previous 
periods, 

Assumption  0.     The  error  is  not  correlated  with  any 
predetermined  variable. 

To  appreciate  this  assumption,  suppose  that  (for  whatever  reason) 
sellers  set  today's  price  pt  on  the  basis  of  the  change  in  the  quantity 
sold  yesterday  over  the  day  before;  that  is, 

Vt  =  «  +  7te-t  -  qt-z)  +  ut 
Suppose,  further,  that  the  greater  (and  more  evident)  the  change  in  q 
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the  more  they  strive  to  set  a  price  according  to  the  above  rule.    Such 
behavior  violates  Assumption  6. 

We  can  think  of  examples  where  behavior  is  fairly  exact  (small  u's) 
for  moderate  values  of  the  independent  variable,  but  quite  erratic  for 
very  large  or  very  small  values  of  the  independent  variable.  I  am 
apt,  for  instance,  to  stumble  more  if  there  is  either  too  little  or  too 
much  light  in  the  room.  This,  again,  violates  Assumption  6,  because 
the  error  in  the  stochastic  equation  describing  my  motion  depends  on 
the  intensity  of  light. 

So  we  come  to  the  end  of  our  statistical  assumptions  about  the  error 
u.  When  in  future  discussion  I  speak  of  u  as  having  "all  the  Simplify- 
ing Properties"  (or  as  "satisfying  all  the  Simplifying  Assumptions"), 
I  mean  exactly  these  six. 

Certain  of  these  six  assumptions  can  bo  checked  or  statistically 
verified  from  a  sample;  others  cannot.  I  shall  return  to  this  topic 
later. 

Of  these  assumptions  only  Assumption  1  is  obligatory.  There  are 
decent  estimating  procedures  for  other  sets  of  assumptions. 

1.10.  Mathematical  restatement  of  the  Six  Simplifying 
Assumptions 

1.  ut  is  random  for  every  t:  Some  p(u)  is  defined  for  all  %  such 
that 

0  <  p  <  1        and        J   p(u)  du  =  1 

2.  The  expected  value  of  ut  is  zero: 

tut  =  0        for  all  t 

3.  The  variance  (r„«(0  is  constant  in  time,  and  finite: 

0  <  <TUu{t)  =  cov  (uhut)  =  <Tuu  <  °°         for  all  t 

4.  ut  is  normal: 

p(u)  -  (2t)->*  det  (*«u)->*  exp  [-^(w  -  ew)((r„«)-I(w  -  8w)] 
I  explain  this  fancy  notation  in  the  next  chapter.    I  use  it  because  it 
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generalizes  very  handily  into  many  dimensions.    The  usual  way  to 
write  the  normal  distribution  is 

The  symbol  <r„u  is  the  square  of  cr;  <rtttt  is  the  variance  of  u. 

5.  u  is  not  auto  correlated: 

Z(ut>ut-e)  =  0        for  all  t  and  for  6  j£  0 

6.  u  is  fully  independent  of  the  variable  Z\ 

cov  (ut,Zt-e)  =  0        for  all  t  and  all  0 


1.11.  Interpretation  of  additivity 

The  random  term  u  appears  in  model  (1-2)  as  an  additive  term. 
This  fact  rules  out  interaction  effects  between  u  and  Z.  Absence  of 
interaction  effects  means  that,  no  matter  what  the  level  of  income  Z 
may  be,  a  random  term  of  a  given  magnitude  always  has  the  same 
effect  on  consumption.  Its  impact  does  not  depend  on  the  level  of 
income. 

1.12.  Recapitulation 

We  must  be  very  clear  in  econometrics,  as  well  as  in  other  areas  of 
statistical  inference,  about  what  is  assumed,  what  is  observed,  and 
what  is  guessed,  and  also  about  what  criterion  the  guess  should  satisfy. 
Table  1.1  provides  a  check  list  of  things  we  accept  by  assumption, 
things  we  can  and  cannot  do,  and  things  we  must  do  in  making 
statistical  estimates.  The  items  in  the  first  three  columns  have  been 
introduced  in  this  chapter;  the  estimating  criteria  in  the  fourth  column 
will  be  discussed  in  the  next  chapter. 

Digression  on  the  differences  among  moment, 
expectation,  and  covariance 

Consider  two  variables,  consumption  c  and  yesterday's  income 
2.     They  may  or  may  not  be  functionally  related.     They  have  a 
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Table  1.1 


Thebb  things  are 

ASSUMED 

These  things  are 
observed 

These  things 

ARE  NOT 
OBSERVED 

This  is 
imposed 

These  things  aru 
computed  by  us 

That  a  true  a  and  a 
true  7  exist 

The  true  a 
and  7 

Some  estimat- 
ing criterion 
for  comput- 
ing a,  7,  S 

a,  a  guess  as  to  « 
7,  a  guess  as  to  7 

That  a  true  «<  exists 
in  each  time  period 

The  true  u< 

The  residuals  SU 
(4  -  1,2,  ,  ,  ,  ,B) 

That  ut  has  the 
Six 

Simplifying 
Properties 

«u,  the  ex- 
pected value 

ff«u,  the  vari- 
ance of  the 
error 

(%%,)/$,  the  mean 
of  the  rosiduals  5* 

tnzz,  the  moment  of 
the  residuals  h. 

That  there  is  a 
universe  d,  Zi 

(t  -  1,2 /) 

in  which 

Ci  "  a  +  yZi  +  W 

The  C's  and  Z'» 

of  the  sample 
denoted  by  C,  Z, 
(«  -  1,2 S) 

The  C's  and 
Z'b  not  in 
the  sample 

universe 


U  -  f Cl' 


•  •  > 

•     •     9 


Expectation 

The  average  value  of  c  in  the  universe  is  symbolized  by  Zc  (read 
"expected  value  of  c"  or  "expectation  of  o").  Similarly,  e«  is 
the  average  z  m  J/te  universe, 

Covariance 
The  covariance  of  c  and  z  is  defined  as  the  expected  value 

Efe  -  ic)(*  -  es) 

where  i  runs  over  J/te  entire  universe.    This  is  symbolized  by 
cov  (c,0)  or  <rCf. 

Variance 

The  variance  of  c  is  simply  the. covariance  of  c  and  c.  It  is 
written  var  c,  or  cov  (c,c),  or  (rcc. 
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Now  consider  a  specific  sample  S°  made  up  of  specific  (corre- 
sponding) pairs  of  consumption  and  income,  for  instance, 


[C27,       C64,       ClOS  I 
Z27,     254,     Z105J 


Let  the  sample  means  for  this  particular  sample  be  written  c° 
and  z°,  respectively. 

Moment 

The  moment  (for  sample  S°)  of  c  on  t  is  defined  as  the  expected 
value 

E($  -  c')(z9  -  2?) 

where  s  runs  over  27,  54,  and  105  only.  It  is  symbolized  by 
mc.e(S°)  or  mc.f  or  simply  mct.  Of  course,  a  different  sample  S1 
would  give  a  different  moment  mc.t(Sl). 

Expectation  of  a  moment 

Now  consider  all  samples  of  size  3  that  we  can  draw  (with 
replacement)  from  the  universe  U.  Then  the  expectation  of  mC9 
is  the  average  of  the  various  moments  me.M(S°),  mc.,(S1),  etc., 
when  all  conceivable  samples  of  size  3  are  taken  into  account. 

A  universe  with  J  elements  generates  f    )  such  samples,  and  the 

means  c  and  z  of  the  two  variables  vary  from  sample  to  sample. 

The  expectation  of  mct  for  samples  of  size  4  is,  in  general,  a 
different  value  altogether. 

Much  confusion  will  be  avoided  later  on  if  these  distinctions 
are  kept  in  mind.  Clear  up  any  questions  by  doing  the  exercises 
below. 

Exercises 

l.A  If  c,  c',  c",  c*  are  four  independent  drawings  (with  replace- 
ment) from  the  universe  U,^prove~  that  e(c'  -f  c"  +  c*)  —  3  ec. 

l.B  If  c,  z}  q  are  variables  and  A;  is  a  constant,  which  of  the  follow- 
ing relations  are  identities? 
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cov  (c,z)  =  cov  («,c)  •  J. 

'met  =  m»e  • 

e(c  -f  z)  =  sc  -f  ez  •    v 

cov  (A;c,^)  =  A;  cov  (c,z)  • 

var  (ft?)  =  k2  var  5  ,  x 

m(*e).a  =  fcmM   • 

cov  (c  +  $,«)  =  cov  (c,  z)  4-  cov  (q,z)  / 

Further  readings 

Tlic  art  of  model  specification  is  learned  by  practice  and  by  studying 
cleverly  contrived  models.  Beach,1  Klein,  Tinberoen,  and  Tintnee  give 
several  examples.  L.  R.  Klein  and  A.  S.  Goldberger,  An  Econometric  Model 
of  the  United  States:  1929-1952  (Amsterdam:  North-Holland  Publishing 
Company,  1957),  present  a  celebrated  large-scale  econometric  model.  Chap- 
ters 1  and  2  give  a  good  idea  of  the  difficulties  of  estimation.  The  performance 
of  this  model  is  appraised  by  Karl  A.  Fox  in  "Econometric  Models  of  tha 
United  States"  {Journal  of  Political  Economy ;  vol.  44,  no.  2,  pps  128-143, 
April,  195G)  and  by  Carl  Christ  in  "Aggregate  Econometric  Models"  (American 
Economic  Review,  vol.  46,  no.  3,  pp.  385-408,  June,  1956). 

"The  Dynamics  of  the  Onion  Market,"  by  Daniel  B.  Suit!  and  Susumu 
Koizumi  (Journal  of  Farm  Economics,  vol.  38,  no.  2,  pp.  475-484*  May,  1956)^ 
is  an  interesting  example  of  econometrics  applied  to  a  particular  market  In 
the  short  run. 

Kendall,  chap.  7,  reviews  the  logic  of  probability,  sampling,  and  expected 
value.  For  a  lucid  discussion  of  the  concept  of  randomness,  iej  M«  G. 
Kendall,  "A  Theory  of  Randomness"  (Biomelrika,  vol.  32,  pt.  1,  pp.  1-15, 
January,  1941). 

As  far  as  I  know,  the  assumptions  about  the  random  terra  have  not  been 
discussed  systematically  from  the  economic  point  of  view,  except  for  Mar- 
schak's  brief  passage  (pp.  12-15)  in  Hood,  chap.  1,  sec.  7.  See  also  Gerhard 
Tintncr,  The  Variate  Difference  Method  (Cowles  Commission  Monograph  5, 
pp.  4-5  and  appendixes  VI,  VII,  Bloomington,  Indiana:  Prineipia  Press, 
1940),  and  Tintner,  "A  Note  on  Economic  Aspects  of  the  Theory  of  Errors 
in  Time  Series"  {Quarterly  Journal  of  Economics,  vol.  53,  no.  1,  pp.  141-149, 
November,  1938). 

As  defined  in  economics  textbooks,  the  production  function  and  the  cost 
function  necessarily  violate  Assumption  2.  In  no  instance  (whether  in  the 
universe,  the  population,  or  the  sample)  can  the  random  disturbance  exceed 
zero  in  the  production  function  and  fall  short  of  zero  in  the  average  cost 

1  See  Frequent  References  at  front  of  book.  Works  of  authors  whose  names 
are  capitalized  are  listed  there. 
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function.  All  statistical  studies  of  production  and  cost  functions  I  know  of 
have  implicitly  used  the  assumption  that  zu  «■  0.  The  error  is  in  the  assump- 
tion of  normality.  Sco,  for  instance,  Joel  Dean,  "Department  Store  Cost 
Functions,"  in  Studies  in  Mathematical  Economics  and  Econometrics,  in  memory 
of  Henry  Schultz,  edited  by  Oscar  Langc,  Francis  Mclntyre,  and  Theodore  0. 
Yntema  (p.  222,  Chicago:  University  of  Chicago  Press,  1942),  which  is  also  an 
interesting  attempt  to  fit  static  cost  functions  to  data  from  years  of  large 
dynamic  changes.  In  this  respect  I  was  guilty  myself  in  "An  Econometric 
Model  of  Growth:  U.S.A.  1869-1953"  {American  Economic  Review,  vol.  45, 
no.  2,  pp.  208-221,  May,  1955). 

For  examples  of  nonadditive  disturbances,  see  Hurwicz,  "Systems  with 
Nonadditivo  Disturbances,"  chap.  18  of  Koopmans,  pp.  410-418. 


CHAPTER  2 


Estimating  criteria  and  the  method 
of  least  squares 


2.1.  Outline  of  the  chapter 

This  chapter,  like  the  previous  one,  deals  exclusively  with  single- 
equation  models.  Unless  the  contrary  is  stated,  all  the  Simplifying 
Assumptions  of  Sec.  1.9  remain  in  force.  The  main  points  of  this 
chapter  are  the  following: 

1.  Once  we  have  specified  the  model  and  made  certain  stochastic 
assumptions,  our  sample  tells  us  nothing  about  the  unknown  parame- 
ters of  the  model  unless  we  adopt  an  estimating  criterion. 

2.  A  very  reasonable  (and  hard  to  replace)  criterion  is  maximum 
likelihood.  It  is  based  on  the  assumption  that,  while  we  were  taking 
the  sample,  Nature  performed  for  our  benefit  the  most  likely  thing, 
or  generated  for  us  her  most  probable  sample. 

3.  Once  the  maximum  likelihood  criterion  is  adopted,  we  can  tell 
precisely  what  the  unknown  parameters  must  be  if  our  sample  was 
the  most  likely  to  turn  up.  This  is  what  is  called  maximizing  the 
likelihood  function.  We  find  the  unknowns  by  manipulating  this 
function. 
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4.  The  familiar  least  squares  fit  arises  as  a  special  case  of  the  oper- 
ation of  maximizing  the  likelihood  function. 

5.  In  many  cases,  adopting  the  maximum  likelihood  criterion  auto- 
matically generates  estimates  that  conform  to  other  estimating 
criteria,  for  example  unbiasednessy  consistency,  efficiency. 

6.  If  estimates  of  the  unknown  parameters  are  unbiased,  consistent, 
etc.,  this  does  not  mean  that  our  particular  sample  or  method  has 
given  us  a  correct  estimate.  It  means  that,  if  we  had  infinite  facilities 
(or  infinite  patience),  we  could  get  a  correct  estimate  "in  the  long  run" 
or  "on  the  average." 

7.  The  likelihood  function  not  only  tells  us  what  values  of  the 
parameters  give  the  greatest  probability  to  the  observed  event  but 
also  attaches  to  such  values  degrees  of  credence,  or  reliability. 

Though  tinse  statements  can  be  made  about  all  sorts  of  models, 
the  single-equation  model  of  consumption  that  I  have  been  using  all 
along  captures  the  spirit  of  the  procedure.  Multi-equation  models 
have  all  the  complications  of  single-equation  models  plus  many  others. 

2.2.  Probability  and  likelihood 

In  common  speech,  probability  and  likelihood  are  but  Latin  and 
Saxon  doublets.  In  statistics  the  two  terms,  though  often  inter- 
changed for  the  sake  of  variety  or  style,  have  distinct  meanings. 
Probability  is  a  property  of  the  sample;  likelihood  is  a  property  of  the 
unknown  parameter  values. 

Probability 

Imagine  that,  in  a  model  that  described  Nature's  workings  perfectly, 
the  true  values  of  the  parameters  a,  /?,  7,  .  .  .  were  such  and  such 
and  that  the  true  stochastic  properties  of  the  error  term  u  were  such 
and  such.  We  would  then  say  that  certain  types  of  natural  behavior 
(i.e.,  certain  samples  or  observations)  were  more  probable  than  others. 
For  example,  if  you  knew  that  a  river  flowed  gently  southward  at  a 
speed  of  3  miles  per  hour,  that  an  engineiess  boat  drifting  on  it  had 
such  and  such  dimensions,  weight,  and  friction  (the  model);  if,  in 
addition,  you  knew  that  gentle  breezes  usually  blow -in  the  area,  very 
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rarely  faster  than  5  miles  per  hour,  and  that  they  usually  blow  now 
in  one,  now  in  another  direction  (the  stochastic  properties) ;  then  you 
would  be  very  much  surprised  to  find  an  instance  in  which  the  boat 
had  traveled  25  miles  northward  or  30  miles  southward  in  the  space 
of  2  hours  (the  improbable  behavior). 

Likelihood 

Now  reverse  the  position.  If  you  were  sure  of  your  information 
about  the  wind,  if  you  did  not  know  which  way  or  how  fast  the  river 
flowed,  but  you  observed  the  boat  28  miles  south  of  where  it  was 
2  hours  ago  and  were  willing  to  assume  that  Nature  took  the  most 
probable  action  while  you  happened  to  be  observing  her,  then  you 
would  infer  that  the  river  must  have  a  southward  current  of  14  miles 
per  hour.  This  is  the  maximum  likelihood  estimate,  or  most  likely 
(NOT  most  probable)  speed  of  the  river  on  the  evidence  of  this 
unfortunate  sample.  Any  other  southward  speed  and  any  kind  of 
northward  flow  are  highly  unlikely,  or  less  likely  than  14  miles  per 
hour  south. 

To  say  that  any  other  speed  is  less  probable  is  to  misuse  the  term. 
The  river's  speed  is  what  it  is  (3  miles  per  hour  to  the  south)  and  it 
cannot  be  more  or  less  probable.  What  can  be  more  or  less  probable 
is  the  particular  observation:  that  the  boat  has  traveled  southward 
28  miles.  This  observation  is  very  improbable  if  the  river  indeed 
flows  3  miles  per  hour  southward.  It  would  be  more  probable  if  the 
river  flowed  southward  with  a  speed  of  5,  7,  or  10  miles.  And  it 
would  be  most  probable  if  the  true  speed  of  the  river  had  been  14  miles 
per  hour.  Evidently,  a  maximum  likelihood  guess  can  be  very  far 
from  the  truth. 

All  estimation  in  econometrics  operates  as  in  this  river  example,  no 
matter  how  elaborate  the  model,  sloppy  or  exquisite  the  sample. 

What  is  so  commendable  about  the  maximum  likelihood  criterion, 
if  it  cannot  guarantee  us  correct  or  even  nearly  correct  results?  Why 
assume  that  Nature  will  do  the  most  likely  thing?  All  I  can  say  to 
this  is  to  ask:  Well,  what  shall  we  assume  instead?  the  second  most 
likely  thing?  the  seventy-first? 

It  is  true  that  (in  some  cases)  maximum  likelihood  estimates  tend 
to  be  correct  estimates  "on  the  average"  or  "in  the  long  run"  (see 
Sees.  2.1  and  2.10)".     These  facts,  however,  are  irrelevant,  because  we 
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use  the  maximum  likelihood  criterion  even  when  we  plan  neither  to 
repeat  the  experiment  nor  to  enlarge  the  sample. 

It  is  very  important  to  appreciate  just  what  maximum  likelihood 
estimation  docs:  The  experimenter  makes  one  observation,  say,  that 
the  boat  had  traveled  28  miles  southward  in  2  hours;  he  then  asserts 
hopefully  (he  does  not  know  this)  that  the  wind  has  been  calm,  because 
this  is  the  most  typical  total  net  wind  speed  for  all  conceivable  2-hour 
stretches;  and  so  he  lets  his  estimate  of  the  speed  be  14  miles. 

Actually,  we  (who  happen  to  know  that  the  true  speed  is  3  miles) 
realize  that,  while  the  experimenter  was  busy  measuring,  the  weather 
was  not  at  all  typical  but  happened  to  be  the  improbable  case  of 
2  hours  of  strong  southerly  wind. 

The  same  experimenter  under  different  circumstances  might  esti- 
mate the  speed  to  be  G,  0.5,  —2,  —3.0,  etc.,  miles  per  hour  depending 
on  the  wind's  actual  whim  during  the  2-hour  interval  in  which  observa- 
tion took  place. 

Exercise 

2.A  Set  up  an  econometric  model  of  the  river-and-boat  example 
of  Sec.  2.2,  using  the  following  symbols:  dt  for  the  number  of  miles 
(from  a  fixed  point)  traveled  southward  in  t  hours  by  the  boat,  y  for  the 
(unknown)  speed  of  the  river  in  miles  per  hour,  and  ut  for  the  net 
southbound  component  of  the  wind's  speed  in  miles  per  hour.  Let  ut 
have  the  following  stochastic  specification : 

10  per  cent  of  the  time  ut  —  11  (southbound) 

70  per  cent  of  the  time  ut  =  0  (calm) 

10  per  cent  of  the  time  ut  =  —  5  (northbound) 

10  per  cent  of  the  time  u%  =  —  6  (northbound) 

Construct  a  probability  table  giving  the  net  wind  effects  for  2  hours  in 
succession.  For  each  typr<  of  conceivable  observation,  derive  the 
maximum  likelihood  estimate  of  7. 


Digression  on  the  multivariate  normal  distribution 

The  univariate  normal  distribution  for  a  variable  u  with 
universe  mean  eu  and  variance  «ruu  was  written  in  Sec.  1.10  in  the 
fancy  form 
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p(u)  =  (2ir)-w  det  (*„„)-»  exp  [-K(w  -  ctiXO"1^  -  *01 

(2-1) 

because  of  the  ease  with  which  it  generalizes  to  the  multivariate 
case. 

Let  u\y  Ui,  .  .  .  ,  un  be  N  variables  which  have  a  joint  normal 
distribution.    Define 

U   =   VeC  (UlyUiy    .    .    .    }un) 

eu  «  vec  (ewi,£W2,  .  .  .  ,zun) 


<ruu  — 


«i«i 


J^UNUi 


For  <r„lttJ  we  often  write  cov  (wi,W2),  or  simply  cn  if  the  meaning 
is  clear  from  the  context.  Sometimes  the  inverse  of  <ratt,  usually 
written  (O"1*  is  written  <ruu,  and  its  elements  are  written  a***' 
or  just  o^.  These  superscripts  are  not  exponents.  If  we  need 
to  write  an  exponent  we  write  it  outside  parentheses,  as  in  equa- 
tion (2-1). 

To  get  the  multivariate  distribution  for  u\f  Utf  .  .  .  ,  us,  all 
we  need  to  do  is  change  the  italic  uf8  of  (2-1)  into  bold  characters: 

p(u)  =  (2tt)-"/2  det  (cruu)->*  exp  [-*4(u  -  eu)(cruu)-»(u  -  Su)] 

(2-2) 

This  illustrates  the  principle  noted  in  Sec.  1.2:  that  if  an  oper- 
ation, theorem,  property,  etc.,  holds  for  simple  numbers,  it  holds 
analogously  for  matrices.  This  is  a  great  convenience,  because 
you  can  pretend  that  matrices  are  numbers  and  so  collapse  a 
complicated  formula  into  a  shorter  and  more  intuitive  expression. 
Moreover,  by  pretending  a  matrix  is  a  number,  you  can  get  a 
clear  impression  of  what  a  formula  conveys. 

Exercises 

2.B    Write  explicitly  the  joint  normal  distribution  of  the  two 
variables  x  and  w. 
2.C    In  Exercise  2.B,  modify  the  formula  for  <rxx  =  aw  and  o-,,*  »  0. 


28  ESTIMATING   CRITERIA  AND  THE  METHOD  OP  LEAST  SQUARES 

2.D    Write  in  vector  and  matrix  notation  the  formula 

N       N 

—  H     2,       2     (Wm  *""   eW^nn(Pn—   tWn) 
tn-  1  n  —  1 


2.3.  The  concept  of  likelihood  function 

Consider  again  the  river-and-boat  illustration  of  the  previous  sec- 
tion.   Our  information  can  come  in  one  of  several  different  ways. 

1,  A  sample  of  one  observation 

Someone  may  have  sighted  the  boat  at  the  zero  point  at  twelve 
o'clock  and  28  miles  south  of  that  2  hours  later.  This  is  one  observa- 
tion; it  leads  to  i  —  2%  =  14  miles  per  hour  southward  as  the  maxi- 
mum likelihood  estimate  of  the  river's  speed.  The  number  of  hours 
elapsed  from  the  beginning  to  the  end  of  the  observation  could  have 
been  1  or  J^  or  7  or  anything  else. 

2.  A  sample  of  several  independent  observations 

We  may  have  several  observations  like  the  above  but  made  on 
different  days.     For  instance, 

Observation       Time  elapsed       Distance  traveled 


i 

2  hours 

28  miles  south 

2 

4  hours 

12  miles  south 

3 

17  hours 

44  miles  south 

3.  Several  interdependent  observations 

Observations  may  overlap ;  as,  for  example, 

Observation       Time  of  observation       Distance  traveled 

a  12  to  2  p.m.  28  miles  south 

b  1  to  5  p.m.  20  miles  south 

of  the  same  day 

Or  information  may  come  in  even  more  complicated  ways.  The  likeli- 
hood function  can  be  constructed  only  if  we  know  both  the  circum- 
stances of  our  observations  and  the  readings  derived  from  them. 
Cases  1,  2,  and  3  lead  to  different  likelihood  functions  because  the 
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circumstances  differ.  Two  observers,  each  of  whom  watched  the  boat 
for  2  consecutive  hours  unbeknownst  to  the  other,  would  set  up  two 
likelihood  functions  identical  in  form  into  which  they  would  feed 
different  readings.  But  each  investigator  would  set  up  on«  and  r  uy 
one  likelihood  function.  This  is  a  function  of  a  single  sample^  th$ 
sample,  his  sample;  no  matter  how  independent,  complicated,  or 
interdependent  his  observations  may  be,  they  form  a  single  sample. 
The  maximum  likelihood  criterion  tells  us  to  proceed  as  if  Nature 
did  the  most  probable  thing.  We  assert  this  about  the  totality  of 
observations  in  the  sample  rather  than  about  any  single  observation. 

2.4.  The  form  of  the  likelihood  function 

Return  to  the  consumption  model  Ct  =  a  +  yZt  4*  t*i.  Tht3  fol* 
lowing  statement  must  be  accepted  on  faith  (its  proof  is  a  decpish 
theorem  in  analysis):  Under  the  assumptions  that  Nature  conforms 
to  the  model  and  that  the  true  values  of  the  parameters  are  a  and  yt 
the  probability  of  observing  the  particular  sample  Ci,  ft,  .  .  ,  ,  C&9 
Zi,  Z*,  .  .  .  ,  Z$  is  equal  to  the  probability  that  the  error  term  shall 
have  assumed  the  particular  values  u%,  u2,  .  .  .  ,  u$  multiplied  by  & 
factor  det  J. 

The  term  det  J  happens  to  be  equal  to  1  in  all  single-*-^ 
cases;  so  we  need  not  worry  about  it  yet.    It  becomes  important  }i 
two-or-morc-cquation  models* 

The  statement  cited  above  is  of  immense  and  curioM§  ^gniflcmnee^ 
We  observe  the  sample  C„  Z,.  But  we  cannot  know  (HfrvAlj  h®W 
probable  or  improbable  it  is  to  obtain  this  particular  Simple,  s\m<i 
all  our  stochastic  assumptions  have  to  do  with  the  probability  distribu- 
tion of  the  u's}  not  of  the  C's  and  Z's.  On  the  other  hand,  Wi  mm 
never  observe  the  random  errors  themselves.  So  one  might  despciir 
of  finding  the  probability  of  this  particular  sample  but  far  tfm  remark- 
able property  cited.  Let  L  stand  for  the  probability  ©f  the  sample 
and  q  for  the  probability  that  the  random  term  will  take  cm  the  values 
Wii  w2,  .  .  .  ,  Us.    Then  we  have 

L  =  det  J  •  q(ui,u2,  ...  yus)  (24) 

Now,  the  (unobservable)  w's  are  functions  of  the  (observed)  C's 
and  Z's  and  of  (the  unknown)  a  and  7,  because  the  model  implies 


30 


ESTIMATING   CRITERIA  AND  THE   METHOD   OP  LEAST  SQUARES 


Ut  =  Ct  —  a  —  yZt.  To  maximize  likelihood  is  to  seek  the  pair  of 
values  of  a  and  7  that  makes  L  as  large  as  possible. 

What  form  q(uh  .  .  .  ,us)  takes  depends  on  the  stochastic  assump- 
tions about  the  error  term. 

This  concludes  the  discussion  of  the  logic  behind  maximum  likelihood 
estimating  of  a  and  7. 

In  the  next  few  pages  I  discuss  the  mechanism  of  maximizing  L  under 
the  Six  Simplifying  Assumptions.  On  a  first  reading,  you  might  skip 
the  rest  of  this  section  without  serious  loss.  Readers  who  wish  to 
refresh  your  manipulative  skills,  read  on!  We  shall  now  omit  writing 
det  J  =  1,  since  we  are  discussing  only  single-equation  cases  at  this 
point. 

By  Simplifying  Assumption  4,  the  random  terms  uh  u%}  .  .  .  ,  us 
come  from  a  multivariate  normal  distribution.  Therefore  (2-2) 
applies,  and 

L  -  q(uhuh  .  .  .  ,us)  =  (2ir)sn  det  (<ruu)-*,  ^  40 

exp  [-y2(u  -  eu5Vuu)-l(u  -  en)]    (2-4) 

By  Simplifying  Assumption  2,  Zut  —  0.  Simplifying  Assumption  3 
states  that  all  diagonal  elements  of  o-uu  are  equal  to  a  finite  constant 
(Tuuy  and  Assumption  5  states  that  all  nondiagonal  elements  are  zero; 
so 

det  (cruu)->*  =  (*uu)-Sf* 


Therefore,  (2-4)  reduces  to 


L  -  (2ir)-s>*(vuu)-s'*  exp 


•1    1 
u 


and  finally  to 


a 

L  =  (2t)-.s/V««)-5"  exp  [-^((rWtt)-1  ^  w2.]  (2-5) 

•-1 

The  following  properties  of  L  will  not  be  proved: 

1.  L  is  a  continuous  function  with  respect  to  a,  7,  cr„„  except  at 
<r«tt  =  0.  This  means  that  it  can  be  differentiated  quite  safely.  As 
for  the  exception,  we  need  not  worry  about  it;  for  w,  as  a  random 
variable,  assumes  at  least  two  distinct  values  in  the  universe,  and 
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therefore  cUu  >  0.  If  the  sample  is  of  only  two  observations,  the  fit 
is  perfect;  muu  is  zero — but  in  that  case  we  do  not  use  the  likelihood 
approach  at  all.  We  just  solve  the  two  equations  C\  =»  a  +  yZi  and 
Ci  =  a  4"  7Z2  for  the  two  unknowns  a  and  7. 

2.  Setting  the  partial  derivatives  of  L  equal  to  zero  locates  its 
maxima.  It  has  no  minimum;  therefore,  we  do  not  need  to  worry 
about  second-order  conditions  of  maximization. 

3.  L  is  a  maximum  when  its  logarithm  is  a  maximum.  So,  instead 
of  (2-5),  we  maximize  the  more  convenient  expression 

log  L  m  -  -  log  2ir  -  g  log  <r«„  -  g  (*««)~l  \  u\         (^6) 

4.  The'true  values  of  a,  7,  and  <ruu  are  not  functions  of  one  another, 
but  constants.  Therefore,  in  maximizing,  all  partial  derivatives?  of  m, 
y,  and  <r  with  respect  to  one  another  are  zero. 

Maximizing  (2-6)  results  in 

V  (C.  -  a  -  yZ.)  -  0 
V  (C.  -  «  -  7^)2.  -  0'  (2.7) 

J  V  (C.  -  «  ~  7^.)2  -  *.. 

The  solution  of  (2-7)  for  a,  7,  o^m  gives  the  maximum  likelihood  esti- 
mate <$,  ^,  £. 

2.5.  Justification  of  the  least  squares  technique 

It  is  evident  that  (2-6)  gives  least  squares  estimates  for  a,  7,  a. 

System  (2-7)  says  that  the  maximum  likelihood  values  of  a  and  7 
are  the  values  that  minimize  the  sum  of  the  squares  of  the  residuals  #,. 
The  last  equation  in  (2-7)  states  that  the  maximum  likelihood  esti- 
mate &uu  of  the  true  variance  o-„„  is  the  average  square  residual. 

This,  then,  is  the  justification  for  minimizing  squares.  Remember 
that  to  get  this  result  we  had  to  make  use  of  a  great  many  assumptions 
both  about  the  model  itself  and  about  the  nature  of  its  error  term. 
If  any  one  of  these  many  assumptions  had  not  been  granted,  we 
might  not  have  reached  this  result.     Therefore,  one  should  not  go 
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about  minimizing  squares  too  lightheartedly.  For  every  different  set 
of  assumptions  a  certain  estimating  procedure  is  best,  and  least 
squares  is  best  only  with  a  proper  combination  of  assumptions.  Con- 
versely, every  estimating  procedure  contains  in  itself  (implicitly)  some 
assumptions  either  about  the  model,  or  about  the  distribution  of  w, 
or  both.1 

Digression  on  computational  arrangement 


uu 


It  pays  to  develop  a  tidy  scheme  for  computing  d,  1,  and  & 
because  computation  recipes  similar  to  (2-7)  turn  up  pretty  often. 

It  is  always  possible  to  arrange  the  computations  in  such  a 
way  as  to  estimate  the  coefficient  y  of  the  independent  variable 
first.  With  •?  in  hand,  one  computes  the  constant  term  d. 
Finally,  with  d  and  1,  one  computes  the  residuals  #„  and  from 
these  residuals,  an  estimate  of  <ruu. 

An  analogous  procedure  for  models  having  several  independent 
variables  (and,  hence,  several  7s)  is  developed  in  the  next 
digression. 

In  all  cases,  that  is  to  say,  for  simple  as  well  as  for  complicated 
models,  I  shall  describe  only  the  computational  steps  for  estimat- 
ing the  7s  (coefficients  of  the  independent  variables). 

Write  (2-7)  as  follows: 

aS  +  yXZ,  =  2CV 
aZZ,  +  yZZ\  =  2C,Z. 

where  the  sums  run  over  the  entire  sample.     Now  subtract  2Z. 
times  the  first  equation  from  S  times  the  second.     The  result  is 

7[£2Z2  -  (2Z)(2Z)]  -  [S2CZ  -  (2C)(2Z)]  (2-8) 

Note  that  we  have  eliminated  a  and  that,  moreover,  in  the 
square  brackets  we  may  recognize  the  familiar  moments,  defined  in 
Chap.  1.     Thus  (2-8)  is  equivalent  to 

1  The  Six  Simplifying  Assumptions  are  sufficient  but  not  necessary  conditions 
for  least  squares.  Least  squares  is  a  "best  linear  unbiased  estimator"  under 
much  simpler  conditions.  This,  however,  is  another  subject.  I  chose  these 
particular  six  assumptions  because  with  them  it  is  easy  to  show  how  a  stochastic 
specification  and  an  estimating  criterion  lead  to  a  specific  estimate  of  a  parameter 
rather  than  to  some  other  estimate. 
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yvftzz  =  m>cz 
and  the  estimate  of  7  can  be  expressed  very  simply  as 

<?  =  (mzz)-lmzc        or        ^£  (2-9) 

mzz 

which,  besides  being  compact,  generalizes  easily  to  N  dimensions, 
i.e.,  by  replacing  the  Greek  and  italic  letters  by  the  correspond- 
ing characters  in  boldface  type : 

?  -  (mZz)-1mzc        or        ^  (2-9a) 

mzz 

2.6.  Generalized  least  squares 

All  the  principles  discussed  so  far  apply  to  all  linear  models  consisting 
of  a  single  equation.  To  treat  the  general  case,  we  shall  make  a  slight 
change  in  notation:  y  will  stand  for  any  endogenous  variable  (the  role 
played  by  consumption  C  so  far)  and  z  for  any  exogenous  variable 
(the  role  played  by  lagged  income  Z  so  far). 

Let  us  suppose  that  the  endogenous  variable  y(t)  depends  on  // 
different  predetermined  variables  Z\(t),  22(0  >  .  .  .  ,  zn(t)  as  follows 
(omitting  time  indexes) : 

y  =  a  -f  7i2?i  -|-  72^2  +  ■•'••.+  ynZH  +  u  (2-10) 

Indeed,  the  analogy  of  (2-10)  with  C  =  a  -f-  yZ  +  u  is  so  perfect 

that  everything  said  about  the  latter  applies  to  the  shorthand  edition 

of  (2-10): 

y  =  a  +  yz  +  u 

where  y  is  the  vector  (71,72,  .  .  .  ,7/y)  and  z  is  the  vector1  (21,2s, 

.    •    •    ,2//). 

But  we  must  be  careful.  The  first  five  Simplifying  Assumptions 
(u  is  random,  has  mean  zero,  has  constant  variance,  is  normal,  and  is 
serially  independent)  need  no  alteration*  Assumption  6  must,  how- 
ever, be  changed  to  read  as  follows:  The  error  term  ut  is  fully  inde- 
pendent of  Zi(jt)t  22(0>  •  •  •  >  2#(0. 

Under  the  new  version  of  the  Simplifying  Assumptions,  the  maxi- 

1  For  typographical  simplicity  I  shall  not  bother,  in  obvious  cases,  to  dis- 
tinguish a  column  from  a  row  vector.    In  this  case  z  is  a  column  vector. 
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mum  likelihood  criterion  leads  to  the  estimate  of  71,  72,  ...  ,  in 
that  minimizes  the  sum  of  the  squared  residuals.  And,  moreover, 
these  estimates  are  given  by  the  formula 


t  -  (m,,)-^, 


(2-11) 


which  is  exactly  analogous  to  (2-9).     What  these  boldface  symbols 
mean  is  explained  in  the  next  digression. 

Digression  on  matrices  of  moments  and  their  determinants 

This  is  a  natural  place  to  introduce  some  extremely  convenient 
notation,  which  we  shall  be  using  from  Chap.  6  on. 

If  p,  q,  r,  x,  y  are  variables,  m^p,q,r).(XlV)  is  the  matrix  whose 
elements  are  moments  that  can  be  constructed  with  p,  q,  r  on  x 
and  y.  The  variables  in  the  first  parentheses  correspond  to  the 
rows  a^d  those  in  the  second  to  the  columns.     Thus, 


m(p.«.r)-(x.v) 


m, 


m. 


qx 

mrx 


Ttlpy 

mqy 

Wry  j 


The  middle  dot  in  the  subscript  may  be  omitted. 

Likewise,  m«  means  the  matrix  whose  elements  are  moments  of 
the  variables  zif  z2,  .  .  .  ,  zh  on  themselves: 


m««  = 


*i*i 


i*» 


Lm 


IB*** 


m 


*fl»t 


and  m™  means 


m 


*iV 


m 


*l*B 


m: 


*a*B* 


.mt*u- 
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Every  square  matrix  has  a  determinant.  So  does  every  square 
matrix  of  moments,  for  instance,  mM;  for  the  determinant  of  m 
we  write  det  m,  perhaps  with  the  appropriate  subscripts  dct  mn, 
or  det  m<Zl,*, *B)Uu*% **>• 

But  it  is  simpler  to  write  m„  instead  of  det  m  or  det  m„; 
and  we  shall  do  this  for  compactness.     The  lightface  italic  m  in         \ 
the  expression  mzz  indicates  that  the  determinant  is  a  simple 
number,  like  2  or  16.17,  and  neither  a  vector  nor  a  matrix  of 
numbers  (these  are  printed  bold). 

One  way  to  estimate  the  coefficients  71,  72,  .  .  .  ,  7#  is  to 
perform  the  matrix  operations  given  in  (2-11).  Another  way 
is  by  Cramer's  rule}  which  calculates  various  determinants  and 
computes 

71  ■= 

™(fn*i *m)(*i. » *h) 

A  rtl(tx,y,  ...  .      t„)(tl,tt f//) 

72  = 

W(Zl>f|(  .  ,  .  ,  ftf)(*,,*, gB) 

(2-12) 


*       __     Wjtu't l/)(gl.«| *b) 

iff  yy. 

"*(»li*| *b)Ui>** *b) 

Both  these  ways  are  very  cumbersome  in  practice  for  equations 
with  more  than  three  or  four  variables,  unless  we  have  ready 
programs  on  electronic  computers.  Appendix  B  gives  a  stepwise 
technique  for  calculating  ^1,  ^2,  .  .  .  ,  1h  that  can  be  used  on 
an  ordinary  desk  calculator. 
Matrix  inversion  is  discussed  in  Appendix  A. 

2.7.  The  meaning  of  unbiasedncss 

Let  us  discuss  bias  and  unbiasedncss  by  using  the  original  model  of 
consumption  Ct  =  a  +  yZt  -f-  u>t  with  the  understanding  that  all 
conclusions  hold  true  for  the  generalized  single-equation  model 
y  =  a  +  71Z1  +  *  *  '  +  IhZh  +  u.  Furthermore,  we  can  restrict  our- 
selves, with  some  exceptions,  to  the  discussion  of  7,  because  the 
statements  to  be  made  are  also  true  of  a. 
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Imagine  that  we  obtain  our  guess  ^  of  the  parameter  7,  violating 
none  of  the  Simplifying  Assumptions.  The  guess  so  chosen  is  the 
most  likely  in  the  circumstances.  But  this  does  not  guarantee  it  to 
be  equal  to  the  true  value  7.  This  is  so  because  the  observations 
C„  Z,  we  have  to  go  on  are  just  a  sample.  And  in  sampling  anything 
can  happen.  Extremely  atypical  misleading  samples  are  improbable 
but  perfectly  possible.  So  it  makes  sense  to  ask  how  far  off  the  guess 
i  is  likely  tojbe  from  the  true  value  7. 

Here  it  is  very  important  to  distinguish  between  (1)  taking  again 
and_agcoin  a  sample  of  size  S,  (2)  taking  bigger  and  bigger. samples 
(one  of  each  size) .  The  first  procedure  is  connected  with  the  important 
statistical  notion  of  bias,  the  second  with  that  of  consistency.  Both 
procedures  are  ideal  and  impractical,  because  such  samples  must  be 
taken  from  the  universe  (level  III)  and  not  merely  from  the  population 
(level  II).  Therefore,  even  with  infinite  resources  and  infinite 
patience,  the  concepts  are  not  operational. 

Consider  any  estimating  recipe  (say,  least  squares).  Choose  a 
sample  size,  say,  S  =  20;  draw  (from  the  universe)  all  possible  samples 
of  size  20;  for  each  sample  compute  (by  least  squares)  the  correspond- 
ing <$;  then  average  out  these  fs.  If  the  average  'f  equals  the  true  7, 
then  we  say  that  the  procedure  of  least  squares  is  an  unbiased  method 
for  estimating  7,  or  an  unbiased  estimator  of  7,  for  sample  size  S  —  20. 
Loosely,  we  might  sajr  that  on  the  average,  least  squares  gives  a  correct 
estimate  of  7  from  samples  of  20  observations. 

In  order  to  pin  down  firmly  the  concept  of  bias,  I  have  constructed 
a  purposely  simple  and  exaggerated  example.  It  involves  just  three 
time  periods,  very  uneven  disturbances  from  time  period  to  time 
period,  and  a  random  disturbance  that  assumes  just  three  different 
values.  Yet  this  example  illustrates  all  that  could  be  shown  with  a 
larger  and  more  realistic  one. 

Assume  that  the  true  values  of  the  parameters  we  seek  to  estimate 
are  a  =  4,  7  =  0.4.  Assume  that  the  population  consists  of  exactly 
3  elements,  labeled  a,  b,  c,  whose  coordinates  are  given  in  Table  2.1 
below;  the  three  points  are  shown  in  Fig.  2.  This  population  could 
have  come  from  an  infinite  universe*  but  let  us  (for  pedagogic  reasons) 
deal  with  a  finite  universe  that  consists  of  the  above  three  points 
a,  b,  c  plus  four  more,  which  are  named  a',  b'y  c',  and  c".  Every  point 
of  the  universe  is  completely  defined  when  we  specify  the  random 
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©c'  and  C" 


True  (exact)  relation 
Cj-4+0.4Zt 


Zt 


10  15  20 

Income 
Fig.  2.  A  seven-point  universe.    Solid  dots:  points  in  the  population.    Hollow 
dots:  points  in  the  universe  but  not  in  the  population. 

Table  2.1 
The  population  (a,b,c) 


Point 

Time 
V 

Zv 

cv 

up 

a 

1 

0 

4.05 

0.05 

6 

2 

4 

0 

0.4 

c 

3 

14 

0.6 

-9.0 

error  u  that  corresponds  to  it  and  the  level  of  the  independent  variable. 
These  are  given  in  Table  2.2. 

Table  2.2 


Time 

Points 

IN  THE  POPULATION 

Points  in  the  universe 
but  not  in  the  population 

Name 

Value 
of  ut 

Value 

OF  Zt 

Name 

Value 

OF  Ut 

ValUH 

OF  Zt 

1 

a 

0.05 

0 

a' 

-0.05 

0 

2 

b 

0.40 

4 

b' 

-0.40 

4 

3 

c 

-9.00 

14 

c' 

+4.50 

14 

3 

... 



C" 

+4.50 

14 

Exercise 

2.E    Which  Simplifying  Assumptions  are  fulfilled  by  ut  in  the  uni- 
verse of  Table  2.2,  and  which  are  violated? 
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Now  let  us  see  if  the  least  squares  method  is  an  unbiased  estimator 
of  7.  First  let  us  take  all  conceivable  samples  of  size  2  and  for  each 
compute  the  least  squares  value  *?»  Samples  should  be  taken  in  such  a 
way  that  the  same  time  period  is  not  represented  more  than  once. 

The  population  can  yield  only  the  following  pairs:  (a,b),  (a,c),  and 
(&,c).  This  is  the  most  that  a  flesh-and-blood  statistician,  even  one 
equipped  with  unlimited  means,  could  obtain  operationally,  because 
points  a',  6',  c',  and  c"  exist,  so  to  speak,  only  in  the  mind  of  God. 
But  the  definition  of  bias  requires  us  to  check  samples  (of  size  2)  that 
include  all  points  of  the  universe,  human  and  divine  alike.  There  are 
sixteen  such  samples,  and  the  corresponding  estimates  &  and  1  are 
given  in  Table  2.3  and  plotted  in  Fig.  3.  When  all  sixteen  are  con- 
sidered, it  is  seen  that  least  squares  is  an  unbiased  estimator  of  7 
(and  of  a). 

Table  2.3 
Estimates  of  a  and  y  from  samples  of  size  2 


Points  in  the  sample 

CORRESPONDING  ESTIMATE 

OP  7 

OF  a 

>a  b 

0.4875 

4.0500 

a  V 

0.2875 

4.0500 

■■"  a  c 

-0.2464 

4.0498. 

a  e 

0.7179 

4.0497 

a  c" 

0.7179 

4.0497 

a'b 

0.5125 

3.9500 

a'b' 

0.3125 

3.9500 

o'c 

-0.2393 

3.9501 

a'  c' 

0.7250 

3.9500 

a'  c" 

0.7250 

3.9500 

.6  c 

-0.5400 

8. 1600 < 

b  c' 

0.8100 

2.7000 

b  c" 

0.8100 

2.7G00 

b'  c 

-0.4G00 

7.0400 

V  C' 

0.8000 

1.6400 

b'  c" 

0.8900 

1.6400 

Average  of  all  conceivable  samples 

57  -      0.4000 

e&  -  4.0000 

Average  of  all  feasible  samples 

(un- 

primed  points) 

-0.0997 

5.4199 

If  we  try  all  samples  of  size  3,  we  get  the  results  tabulated  in  Table  2.4 
and  plotted  in  Fig.  4. 
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Table  2.4 
Estimates  of  a  and  y  from  samples  of  size  3 


Points  in 

* 

THE  SAMPLE 

a  b  c 

5.36728 

-0.30288 

a  b  c' 

3.63652 

0.73558 

a  b  c" 

3.63652 

0.73558 

a  b' c 

5.00833 

-0.28750 

a  b'c' 

3.27757 

0.75096 

a  b'  c" 

3.27757 

0.75096 

a'bc 

5.29939 

-0.29712 

a'b  c' 

3.56857 

0.74135 

a'  b  c" 

3.56857 

0.74135 

a' b'c 

4.94038 

-0.28173 

a'b   c' 

3.20962 

0.75673 

a'  b'  c" 

3.20962 

0.75673 

Average 

e&  -  4.0000 

e  -   0.4000 

For  a  sample  of  size  3,  the  least  squares  method  is  an  unbiased  esti- 
mator of  both  7  and  a. 

In  certain  cases,  not  illustrated  by  our  simple  example,  (1)  an 
estimating  technique  (say,  least  squares)  may  be  unbiased  for  gome; 
sample  sizes  and  biased  for  other  sizes;  (2)  a  method  may  overestimate 
7  for  certain  sample  sizes  and  underestimate  it  for  others,  on  th$ 
average;  (3)  we  may  be  able  to  tell  a  priori,  knowing  the  sample  size S, 
whether  the  bias  is  positive  or  negative  (in  other  cases  we  cannot)  | 
(4)  a  method  may  be  unbiased  for  one  parameter  but  biased  for  another. 

2.8.  Variance  of  the  estimate 


In  Fig.  3  I  have  plotted  all  the  estimates  of  a  and  7  for -all  posalbl© 
samples  of  size  2.  The  same  thing  was  done  for  size  B  in  Fig.  4*  In 
general,  the  estimates  are  scattered  or  clustered,  depending  (I)  on  the 
size  S  of  the  sample,  (2)  on  the  size  and  other  features  of  the  universe, 
(3)  on  the  particular  estimating  technique  we  have  adopted,  and 
ultimately  (4)  on  the  extent  to  which  random  effects  dominate  the 
systematic  variables.  Other  things  being  equal,  we  prefer  an  esti- 
mating technique  that  yields  clustered  estimates.  The  spread  among 
the  various  estimates  $  is  called  the  variance  of  the  estimate  •?,  and  is 
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written  <rtf  or  <r(1,1),  or,  sometimes,  <r(W\S)  if  we  want  to  emphasize 
what  size  sample  it  relates  to. 
The  variance  is  defined  by 

and  is  a  constant,  which  exists  and  can  bo  computed  if  the  four  items 
listed  above  arc  known.    Table  2.5  gives  the  values  of  <r  for  our 
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Fig.  4.  Parameter  estimates  from  all 
samples  of  size  3.    <f>:  double  point. 


seven-point  example.  Note  the  interesting  (and  counterintuitive) 
fact  that  the  variance  of  the  estimate  can  increase  as  the  sample  size 
increases!     This  quirk  arises  bec^usej^jn^Jhe  example,  the  random 

disturbance  has  a  skew  distribution. If  u  is  symmetrical,  the  variance 

of  the  estimate_decreases  as  thej sarnp_le_ size increases. 

Table  2.5 

Sue  S  op  samples  <r($,7) 


0.2325 
0.2397 
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2.9.  Estimates  of  the  variance  of  the  estimate 

If  we  have  complete  knowledge,  we  can  compute  the  true  value  of 
(r(i rf\S)  by  making  a  complete  list  of  all  samples  of  size  S,  computing 
all  possible  estimates  of  7,  and  finding  their  variance,  as  I  did  in  the 
above  example.  In  practice,  however,  it  is  impossible  to  exhaust  all 
samples  of  a  given  size,  because  the  universe  contains  points  that  are 
not  in  the  population.  So,  instead,  we  must  be  content  with  gue&ding 
at  the  variance  of  the  estimate  by  the  use  of  whatever  information  is 
contained  in  the  single  sample  we  have  already  drawn. 

At  first,  you  might  suppose  that  estimating  <r($rt\S)  is  logically 
impossible  when  you  have  a  single  sample  of  size  S  to  work  with, 
because,  after  all,  the  variance  of  the  estimate  of  7  represents  what 
happens  to  7  as  you  take  all  samples  of  size  S. 

All  is  not  lost,  however,  because  a  single  sample  of  size  S  contains 
several  samples  (S  of  them)  each  of  size  <S-minus-l.  The  latter  we  can 
generate  by  leaving  out,  one  at  a  time,  each  observation  of  the  original 
sample.  Thus,  if  the  original  sample  is  (a,b,c)  of  size  S  —  3,  it  con- 
tains three  subsamples  of  size  2  each,  the  following  ones:  (a,6),  (o,c), 
and  (6,c),  which  yield,  respectively,  the  three  estimates  f  (a, 6),  1(a,c), 
and  f(b,c).  We  get,  then,  some  idea  about  variations  in  the  estimate 
of  7  among  samples  of  size  2.  Still,  we  know  nothing  about  the  variance 
of  7  as  estimated  from  samples  of  size  8.  Here  we  invoke  the  maximum 
likelihood  criterion.  The  original  sample  (a,6,c)  was  assumed  to  be 
the  most  probable  of  its  kind,  namely,  the  family  of  samples  containing 
three  observations  each.  If  this  is  so,  then  observations  a,  6,  c 
generate  the  most  probable  triplet  T  =  {(a,6),(a,c),(6,c)j  of  samples 
containing  two  observations  each.  Therefore,  the  variability  of  i 
(in  the  triplet  T)  estimates  its  variability  in  samples  of  size  3. 

From  Table  2.3, 

jfafi)  =      0.4875 

1(a,c)  =  -0.2464 

?(&,c)  =  -0.5400 

Average  =  -0.0997 

The  variance  of  f  in  the  sample  triplet  is  equal  to 

HK0.4875  4-  0.0997)2  +  (-0.2464  +  0.0997)2 

+  (-0.5400  +  0.0997) *]  -  0.1867 
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The  last  figure  must  now  be  corrected  by  the  factor  £-minus-l  «■  2  if  it 
is  to  be  an  unbiased  estimate  of  the  variance  of  $ (a,6,c).  Estimates  of 
variance  based  on  averages,  if  uncorrected,  naturally  understate  the 
variance.    The  proof  that 

&  -  0.1867  X  2  -  0.3734 

is  an  unbiased  estimate  of  o-(f,f|3)  is  in  Appendix  C. 

In  practice  we  are  too  lazy  to  estimate  y  again  and  again  for  all 
the  subsamples.  The  formula  cr(f,f \S)  .=  m^/(S  —  l)mzz  gives  a 
short-cut  (and  biased)  estimate  of  the  variance  of  1  for  samples  of  the 
original  size.  Table  2.6  lists  these  estimates  for  three-point  samples 
and  repeats  some  of  the  information  from  Table  2.4. 

Tabic  2.6 
Estimates  of  7  and  of  the  variance  of  its  estimates 


Points  in 

7 

1(»    M  m            ^ 

THE  SAMPLE 

J("')u,(S-l)m^ 

a  b  c 

-0.30288 

0.02605 

a  b  c' 

0.73558 

0.00256 

a  b   c" 

0.73558 

0.00256 

a  b'  c 

-0.28750 

0.01377 

a  b'c' 

0.75096 

0.00895 

a  V  c" 

0.75096 

0.00895 

a'bc 

-0.29712 

0.02731 

a'b  c' 

0.74135 

0.00217 

a'  b  c" 

0.74135 

0.00217 

a'b'c 

-0.28173 

0.01377 

a'  b'  c' 

0.75673 

0.00822 

a'  b'  c" 

0.75073 

0.00822 

Table  2.6  must  be  interpreted  carefully.  To  begin  with,  the  investi- 
gator will  usually  know  only  its  first  line,  because  he  has  a  single 
sample  to  work  with.  The  remaining  lines  are  put  in  Table  2.6,  for 
pedagogic  reasons,  by  the  omniscient  being  who  can  consider  all 
possible  worlds.  Events,  could  have  followed  one  or  another  course 
(and  only  one)  among  the  courses  listed  in  the  several  lines  of  Table  2.6. 
It  just  happened  that  (a,6,c)  materialized  and  not  some  other  triplet. 
It  yielded  the  two  estimates  -f  =  —0.30288,  a  very  wrong  estimate,  and 
ff  =  0.02C05.  The  latter  misleads  us  to  believe  in  the  likelihood  of  the 
former. 
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If  sample  (a',h',c')  had  materialized,  the  two  guesses  would  have 
been  1  =  0.73558  (not  so  bad  as  before)  and  5  =  0.00256,  which  is 
ten  times  as  "confident"  as  before.  It  is  entirely  possible  for  a 
sample  to  give  a  very  wrong  parameter  estimate  with  a  great  deal  of 
confidence.  The  mere  fact  that  5(1,1)  is  small  does  not  make  1  a 
good  guess. 

It  is  comforting,  of  course,  to  have  some  measure  of  how  much  1 
varies  from  sample  to  sample.  What  is  upsetting  is  that  the  measure 
is  itself  a  guess.  True,  it  is  better  than  nothing,  but  this  is  no  con- 
solation if  by  some  quirk  of  fate  we  have  picked  a  sample  so  atypical 
that  it  gives  us  not  only  a  really  wrong  parameter  estimate  1 ,  but  also 
a  really  small  5(1,1).  The  moral  is:  Don't  be  cocksure  about  the 
excellence  of  your  guess  of  y  just  because  you  have  guessed  that  its 
variance  v(1,1)  is  small. 

2.10.  Estimates  ad  nauseam 

Note  carefully  now  that,  whereas  cr(1,1)  is  a  constant,  5(1,1)  is  not, 
but  varies  with  each  sample  of  the  given  sizeJ  Therefore  5(1,1)  itself 
has  a  variance,  which  we  may  denote  by  c(5(1,1))\  this  is  a  true 
constant.  Now  there  is  nothing  to  prevent  us  from  making  a  guess  at 
the  latter  on  the  basis  of  our  sample,  and  this  guess  would  be  symbolized 
by  5(5(1,1)),  which  is  no  longer  a  constant  but  varies  with  each  sample, 
and  so  has  a  true  variance  a(5(5(1,1))) — and  so  on,  ad  infinitum.  In 
other  words,  we  cannot  get  away  from  the  fact  that,  if  all  we  can  do 
about  7  is  to  guess  that  it  equals  1 ,  then  all  we  can  do  about  its  variance 
c(1,1)  is  to  guess  it  too;  likewise  all  we  can  do  about  this  last  guess  is  to 
guess  again  about  its  true  variance,  and  so  on  forever.  Guess  we 
must,  stage  after  stage,  unless  we  have  some  outside  knowledge. 
Only  with  outside  knowledge  can  the  guessing  game  stop.  The  game 
is  rarely  played,  however,  beyond  5(1,1),  (1)  because  it  is  quite  tedious, 
and  (2)  because  large  enough  samples  give  good  1a  and  as  with  high 
probability. 

2.11.  The  meaning  of  consistency 

As  in  our  explanation  of  unbiasedness,  let  us  discuss  the  parameter 
7  of  the  model  Ct  =  a  +  yZt  +  ut  with  the  understanding  that  all 
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conclusions  generalize  to  all  the  parameters  in  the  model 
y  =  a  +  7i2i  -) h  ynzn  +  u 

Consider  any  estimating  recipe,  say,  least  squares  or  least  cubes. 
Choose  a  sample  of  a  given  size,  say,  S  =  20,  and  compute  *?.  Then 
choose  another  sample  containing  one  more  observation  (S  =  21) 
and  compute  its  ^.  Keep  doing  this,  always  increasing  the  sample's 
size.  The  bigger  samples  do  not  have  to  include  any  elements  of  the 
smaller  samples — though  this  becomes  inevitable  as  the  big  samples 
grow,  if  the  universe  is  finite.1  If,  as  the  size  of  the  sample  grows,  the 
estimates  f  improve,  then  we  say  that  the  least  squares  procedure  is  a 
consistent  estimator  of  7.  Note  that  1  does  not  have  to  improve  in 
each  and  every  step  of  this  process  of  increasing  the  size  of  the  sample. 

Improvement  in  the  above  paragraph  means  that  the  probability 
distributions  of  1(S),  1(S  -f  1),  .  .  .  become  more  and  more  pinched 
as  the}r  straddle  the  true  value  of  the  parameter. 

Digression  on  notation 

There  are  two  variant  notations  for  consistency.    Let  y(s)  be 
the  consistent  estimator  from  a  sample  of  s  observations.     Let  e 
and  Tj  be  two  positive  numbers,  however  small.     Then  there  is,, 
some  size  S  for  which 

P(\y(s)  -  7I  <  0  >  1  -  v 

if  s  >  S.     A  shorthand  notation  for  the  same  thing  is 

P  lim  y(s)  =  7 

which  is  to  be  read  "y(s)  converges  to  7  in  probability/'  or 
"probability  limit  of  y(s)  is  7." 

Under  very  weak  restrictions,  a  maximum  likelihood  estimate  is  also 
a  consistent  estimate.  Note,  however,  that,  even  when  the  method  is 
consistent,  there  is  no  guarantee  that  the  estimate  will  improve  every 
time  we  take  a  larger  sample.  It  may  turn  out  that  our  sample  of 
size  2  happens  to  contain  points  a  and  b,  which  give  an  estimate 
f(2)  =  0.4875,  and  the  next  larger  sample  happens  to  contain  points 

1  A  sample  could,  of  course,  bo  infinite  without  ever  including  all  the  elements 
of  an  infinite  universe. 


2.12.   THE   MERITS  OP  UNBIASEDNESS  AND  CONSISTENCY  45 

o,  6,  and  c,  which  give  an  estimate  7(3)  »  —0.30288,  which  is  much 
worse.  Even  when  the  larger  sample  includes  all  the  points  of  the 
smaller,  as  in  the  example  just  cited,  it  can  give  a  worse  estimate. 
This  is  so  because  the  next  point  drawn,  c,  may  be  so  atypical  as  to 
outweigh  the  previous  typical  points  a  and  b. 

2.12.  The  merits  of  unbiasedness  and  consistency 

Are  the  properties  of  unbiasedness  and  consistency  worth  the  fuss? 
Remember  the  fundamental  fact  that  with  limited  sampling  resources 
it  is  not  possible  to  estimate  y  correctly  every  time,  even  when  the 
estimating  procedure  is  unbiased  and  consistent. 

Because  of  a  small  budget,  our  sample  may  be  so  small  that  ^  has  a 
large  variance.  Even  if  the  sample  is  large,  it  may  be  an  unlucky  one, 
yielding  an  extremely  wrong  estimate.  The  mistake  has  happened, 
and  it  is  no  consolation  to  know  that,  if  we  had  taken  all  possible 
samples  of  that  size,  we  would  have  hit  the  correct  estimate  on  the 
average.  The  following  complaint  is  a  familiar  one  from  the  area  of 
Uncertainty  Economics:  Some  people  advise  me  to  behave  always  so  as 
to  maximize  my  expected  utility;  in  other  words,  to  make  once-in-a- 
lifetime  decisions  as  if  I  had  an  eternity  to  repeat  the  experiment. 
Well,  if  I  get  my  head  chopped  ofT  on  the  first  (and  necessarily  final) 
try,  what  do  I  care  about  the  theoretical  average  consequences  of  my 
decision?  Wherever  a  comparatively  crucial  outcome  hinges  on  a 
single  correct  estimate,  unbiasedness  is  not  in  itself  a  desirable  property. 

Likewise,  it  is  mockery  to  tell  an  unsuccessful  econometrician  that 
he  could  have  improved  his  estimate  if  he  had  been  willing  to  enlarge 
his  sample  indefinitely. 

What,  then,  is  the  use  of  unbiasedness  and  consistency?  In  them- 
selves they  are  of  no  use;  they  do  help,  however,  in  the  design  of 
samples  and  as  rules  for  research  strategy  and  communication  among 
investigators. 

There  is  a  body  of  statistical  theory — not  discussed  in  this  work — 
which  tells  us  how  to  redesign  our  sample  in  order  to  decrease  bias  and 
inconsistency  to  some  tolerable  level.  For  example,  with  infinite 
universe,  if  we  have  two  parameters  to  estimate,  the  theory  says  that 
a  sample  must  be  larger  than  100  if  consistency  is  to  become  "  effective 
at  the  5  per  cent  level."     Whether  we  want  to  take  a  sample  that  large 
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depends  on  the  use  and  strategic  importance  of  our  estimate  as  well  as 
on  the  cost  of  sampling.  All  this  opens  up  the  fields  of  verification  and 
statistical  decision,  into  which  we  shall  not  go  here. 

Unbiasedness,  consistency,  and  other  estimating  criteria  to  be 
introduced  below  are  sometimes  conceived  of  as  scientific  conventions:1 

If  content  to  look  at  the  procedure  of  point  estimation  unpretentiously  as  a 
social  undertaking,  we  may  therefore  state  our  criterion  of  preference  for  a 
method  of  agreement  so  conceived  in  the  following  terms: 

(i)  different  observers  make  at  different  times  observations  of  one  and  the 

same  thing  by  one  and  the  same  method; 
(ii)  individual  seta  of  observations  so  conceived  are  independent  samples 
of  possible  observations  consistent  with  a  framework  of  competence,  and  as 
such  wo  may  tentatively  conceptualise  the  performance  of  successive 
sets  as  a  stochastic  process; 

(iii)  we  shall  then  prefer  any  method  of  combining  constituents  of  observa- 
tions, if  it  is  such  as  to  ensure  a  higher  probability  of  agreement 
between  successive  sets,  as  the  size  of  the  sample  enlarges  in  accordance 
with  the  assumption  that  we  should  thereby  reach  the  true  value  of  the 
unknown  quantity  in  the  limit; 

(iv)  for  a  given  sample  size,  we  shall  also  prefer  a  method  of  combination 
which  guarantees  minimum  dispersion  of  Values  obtainable  by  different 
observers  within  the  framework  of  (i)  above. 

In  the  long  run,  the  convention  last  stated  guarantees  that  there  will  be  a 
minimum  of  disagreement  between  the  observations  of  different  observers,  if 
they  all  pursue  the  same  rule  consistently.  .  .  .  We  have  undertaken  to 
operate  within  a  fixed  framework  of  repetition.  This  is  an  assumption  which 
is  intelligible  in  the  domain  of  surveying,  of  astronomy  or  of  experimental 
physics.  How  far  it  is  meaningful  in  the  domain  of  biology  and  whether  it  is 
ever  meaningful  in  the  domain  of  the  social  sciences  are  questions  which  we 
cannot  lightly  dismiss  by  the  emotive  appeal  of  the  success  or  usefulness  of 
statistical  methods  in  the  observatory,  in  the  physical  laboratory  and  in  the 
Cartographer's  office. 

Philosophers  of  probability  are  still  debating  whether  the  italics  of 
the  quotation  do  in  fact  define  a  universe  of  sampling,  whether  it  can 
be  defined  apart  from  the  postulate  that  an  Urn  of  Nature  underlies 
everything,  and  whether  the  above  scientific  conventions  become 
reasonable  only  upon  our  conceding  the  postulate. 

1  Lancelot  Hogbcn,  Statistical  Theory,  pp.  1106-207  (London:  George  Allen  & 
Unwin,  Ltd.,  1957).     Italics  added. 
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2.13.  Other  estimating  criteria 

So  far  I  have  mentioned  three  estimating  criteria,  or  properties  that 
we  might  desire  our  estimating  procedures  to  have.  These  were 
(1)  maximum  likelihood,  (2)  unbiasedness,  (3)  consistency.  Some 
others  are: 

4.  Efficiency 

If  y  and  1  are  two  estimators  from  a  sample  of  S  observations,  the 
more  efficient  one  has  the  smaller  variance.  It  is  possible  to  have 
^(7>7)  <  <r(1rt)  f°r  some  sample  sizes  and  the  reverse  for  other  sample 
sizes;  or  one  may  be  uniformly  more  efficient  than  the  other;  some 
estimators  are  most  efficient,  others  uniformly  most  efficient. 

5.  Sufficiency 

An  estimator  from  a  sample  of  size  S  is  sufficient  if  no  other  estimator 
from  the  same  sample  can  add  any  knowledge  about  the  parameter 
being  estimated.  For  instance,  to  estimate  the  population  mean,  the 
sample  mean  is  sufficient  and  the  sample  median  is  not. 

6.  The  following  desirable  property  has  no  name.  Let  o($tf\S) 
shrink  more  rapidly  than  <r(y,y\S)  as  the  sample  increases.  Then  i 
is  more  desirable  than  y. 

There  is  no  end  to  the  criteria  one  might  invent.  Nor  are  the  criteria 
listed  mutually  exclusive.  Indeed,  a  maximum  likelihood  estimator 
tends  to  the  normal  distribution  as  the  sample  increases;  it  is  consistent 
and  most  efficient  for  large  samples.  A  maximum  likelihood  estimator 
from  a  single-peaked,  symmetrically  distributed  universe  is  unbiased. 

2.14.  Least  squares  and  the  criteria 

If  all  the  Simplifying  Assumptions  are  satisfied,  the  least  squares 
method  of  estimating  a  and  y  in  single-equation  models  of  the  form 

Ct  -  a  +  yZt  +  ut  (2-13) 

yields  maximum  likelihood,  unbiased,  consistent,  efficient,  and  suf- 
ficient estimates  of  the  parameters.    This  result  can  be  generalized 
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in  a  variety  of  directions.  The  first  generalization  is  that  it  applies  to 
a  model  of  the  form 

y(t)  -  «  -f  7i*i(0  -f  y%z%{t)  +  •  •  •  4-  -hMD  +  u(t)    (2-14) 

where  y  l§  the  endogenous  variable  and  the  z'a  are  exogenous  variables. 
(Least  squares  is  biased  if  some  of  the  z's  are  lagged  values  of  y— this 
question  is  postponed  to  the  r  ext  chapter.) 

Lea^t  squares  yields  maximum  likelihood,  unbiased,  consistent, 
sufficient,  but  inefficient  estimates  if  the  variance  of  ut  is  not  constant 
but  varies  systematically,  either  with  time  or  with  the  magnitude  of 
the  exogenous  variables.  Such  systematic  variation  of  its  variance 
makes  u  heteroskedastic. 


2.15.  Treatment  of  he teroskedasticity 

We  shall  confine  the  discussion  of  heteroskedasticity  to  model 
(2-13)  on  the  understanding  that  it  generalizes  to  (2-14). 

The  random  term  can  have  a  variable  variance  <rUu(t)  for  various 
reasons: 

1.  People  learn,  and  so  their  errors  of  behavior  become  absolutely 
smaller  with  time.     In  this  case  a(t)  decreases. 

2.  Income  grows,  and  people  now  barely  discern  dollars  whereas 
previously  they  discerned  dimes.     Here  a{t)  grows. 

3.  As  income  grows,  errors  of  measurement  also  grow,  because  now 
the  tax  returns,  etc.,  from  which  C  and  Z  are  measured  no  longer 
report  pennies.     Here  c{t)  increases. 

4.  Data-collecting  techniques  improve.     v(t)  decreases. 
Consider  Fig.  5.     It  shows  a  sample  of  three  points  coming  from  a 

heteroskedastic  universe. 

Since  the  errors  are  heteroskedastic,  we  would,  on  the  average, 
expect  observations  in  range  1  to  fall  rather  near  the  true  regression 
line,  observations  in  range  2  somewhat  farther,  and  in  range  3  farther 
still.  In  any  given  sample,  say,  (a,b,c),  points  b  and  c  should  ideally 
be  "discounted"  according  to  the  greater  variances  that  prevail  in 
their  ranges.  Using  the  straight  sum  of  squares  is  the  same  as  failing 
to  discount  b  and  c.  The  result  is  that  sample  (a,b,c)  gives  a  larger 
value  for  7  than  it  would  if  observations  had  been  properly  discounted. 

If  no  allowance  is  made  for  the  changing  variance  <r(t),  least  squares 
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fits  are  maximum  likelihood,  unbiased,  and  consistent  but  inefficient. 
To  show  inefficiency,  consider  the  likelihood  function  of  (2*4),  There, 
the  matrix  4  of  the  covariances  of  the  random  term  not  only  was 
diagonal  but  had  equal  entries;  so  it  could  factor  out  [lit  (i^)]  and 
drop  out  when  the  likelihood  function  was  maximized  with  respect  to 
7  (and  a).  It  is  this  fact  that  made  i  an  efficient  estliimta,  With 
unequal  entries  along  the  diagonal,  this  is  no  longer  possible,  To 
obtain  an  efficient,  unbiased,  and  consistent  estimate  of  y,  we  must 
solve  a  complicated  set  of  equations  involving  7,  <r(l),  *  ■  ,  ,  *(&)% 
Somewhat  less  efficient  (but  more  so  than  minimizing  Su2)  is,  to  make 
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Fig.  5.  A  typical  sample  from  a  heteroskedastio  univerli. 

(from  outside  knowledge)  approximate  guesses  about  cr(l),  *  »  *  ,  9 (&) 
and  to  minimize  the  sum  of  squares  of  appropriately  "  deflated  *# 
residuals  (see  Exercise  2.G).  This,  too,  is  an  unbiased  and  consistent 
estimate. 

Exercises 

2.F  Prove  that  *?  =  mzc/mzz  is  unbiased  and  consistent  even 
when  u  is  heteroskedastio. 

2.G  Let  <f>(s)  be  an  estimate  (from  outside  information)  of  l/<r««(s). 
Prove  that  minimizing  2<t>(s)u2(s)  yields  the  following  estimate  of  7:  4 

=  (S(0)S0CZ)  -  (2<}>Z)(2<f>C) 
7W  (2<t>)(2ct>Z2)  -  (2tf>Z)2 


2.H    Prove  the  unbiasedness  and  consistency  of  i (<£). 
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Digression  on  arbitrary  weights 

The  weights  <f>(s)  are  arbitrary.  Is  there  no  danger  that  the 
denominator  of  ^(<£)  might  be  (nearly  or  exactly)  zero  and  blow 
up  the  proceedings? 

Answer:  There  is  none. 

Proof 

-     £  <fc<fc(Z<  -  Zy)2  >  0 

It  is  perfectly  proper  to  deflate  the  heteroskedastic  residuals  by  the 
exogenous  variable  Z  itself  and  to  fit  by  least  squares  the  homoskedastic 
equation 

§  =  «;~  +  Y  +  !  (2-15) 

instead  of  the  original  heteroskedastic  one 

C  =  a  +  yZ  +  u  (2-16) 

From  (2-15)  and  (2-16)  we  obtain  numerically  different  consistent 
and  unbiased  estimates  of  a  and  y. 

Exercise 

2.1  Prove  that  d(Z)  =  m(c/z)(i/z)/m(i/z)(i/z)  is  unbiased  and  con- 
sistent. 

Further  readings 

Maurice  G.  Kendall,  "On  the  Method  of  Maximum  Likelihood"  (Journal 
of  Vie  Royal  Statistical  Society,  vol.  103,  pt.  3,  pp.  389-399,  1940)  discusses 
the  reasonableness  of  the  method  and  the  concept  of  likelihood.  Whether 
the  principle  of  maximum  likelihood  is  logically  wound  up  with  subjective 
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belief  or  inverse  probability  is  still  under  debate.  The  intrepid  reader  who 
leafs  through  the  last  30  or  so  years  of  the  above  Journal  will  be  rewarded 
with. the  spectacle  of  a  battle  of  nimble  giants:  Bartlett,  Fisher,  Gini,  Jeffreys, 
Kendall,  Keynes,  Pearson,  Yule. 

The  algebra  of  moments  is  a  special  application  of  matrices  and  vectors. 
Matrices  and  determinants  are  explained  in  the  appendixes  ©f  Klein  apd 
Tintner.  Allen  devotes  two  chapters  (12  and  13)  to  all  the  vector,  matrl^ 
and  determinant  theory  an  economist  is  ever  likely  to  need. 

The  estimating  criteria  of  unbiascdness,  consistency,  etc.,  are  clearly  stilted 
and  briefly  discussed  in  the  first  dozen  pages  of  Kendall's  second  volume, 
and  debunked  by  Hogben  in  the  reference  cited  in  the  text. 

The  reason  for  using  m^/Smtt  as  an  estimate  of  o-(f  ,f ),  the  formula  for 
estimating  cov  (&,i)  and  cr((2,d),  and  the  extensions  of  these  formula!  for 
several  y  variables  are  stated  and  rationalized  (in  my  opinion,  not  too  con- 
vincingly) by  Klein,  pp.  133-137. 


CHAPTER  3 


Bias  in  models  of  decay 


3.1.  Introduction  and  summary 

This  chapter  is  tedious  and  not  crucial;  it  can  be  skipped  without 
great  loss.  I  wrote  it  for  two  reasons:  to  develop  the  concept  of 
conjugate  samples,  and  to  show  what  I  have  claimed  in  the  Preface: 
that  common-sense  interpretations  of  intricate  theorems  in  mathe- 
matical statistics  can  be  found. 

The  main  proposition  of  this  chapter  is  that  a  single-equation  model 
of  the  form 

Vi  -  V(t)  -  «  +  7i*i(0  +•■••  +  VHZH(Jt)  +  u(t)  (3-1) 

in  which  some  of  the  z's  are  not  exogenous  variables  but  rather  lagged 
values  of  y  itself,  necessarily  violates  Simplifying  Assumption  6,  and 
hence  that  maximum  likelihood  estimates  of  a,  71,  ...  ,  7#  are 
biased. 

The  concept  of  conjugate  samples  gives  a  handy  and  simple-minded 
but  entirely  rigorous  way  to  test  for  bias.  It  will  be  used  again  and 
again  in  later  chapters  for  models  much  more  complicated  than  (3-1). 

Equations  involving  lags  of  an  endogenous  variable  are  called 
autoregressive. 
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Most  satisfactory  dynamic  econometric  models  are  multivariate 
autoregressive  systems,  in  other  words,  elaborate  versions  of  (3-1), 
and  share  its  pitfalls  in  estimation.  We  shall  see  that  the  character  of 
initial  conditions  affects  vitally  our  estimating  procedure  and  that, 
unfortunately,  in  econometrics  the  initial  conditions  are  not  favorable 
to  estimation,  though  in  the  experimental  sciences  they  commonly  are. 

If  the  initial  condition  y(0)  is  a  fixed  constant  Y,  the  maximum  likeli- 
hood criterion  leads  to  least  squares  regression  of  y(t)  on  y(t  —  1), 
and  the  resulting  estimate  for  7  is  biased,  except  for  samples  of  size  1. 

If  y(0)  is  a  random  variable,  independent  of  u,  then  the  maximum 
likelihood  criterion  does  not  lead  to  least  squares.  If  least  squares  are 
used  in  this  instance,  they  lead  to  biased  estimates,  again  with  the 
exception  of  samples  of  size  1. 

CONVENTION 

The  size  S  of  the  sample  is  given  in  units  that  correspond  to  the 
number  of  points  through  which  a  line  is  fitted.  Thus,  if  we  observe 
only  2/3  and  2/2,  this  is  a  sample  of  one;  S  =  1.  If  we  observe 
2/4, 2/3,  and  2/2,  this  makes  a  sample  of  two  points,  S  «  2,  and  so  on.  In 
his  proof  of  this  theorem,  Hurwicz  (in  Koopmans,  chap.  15)  would  call 
these,  respectively,  samples  of  size  T  =  2  and  T  =  3.  The  difference 
is  important  when  observations  have  gaps  (are  not  consecutive).  We 
shall  confine  ourselves  to  consecutive  samples.  Appendix  D  deals 
with  the  general  case. 

3.2.  Violation  of  Simplifying  Assumption  6 

A  lagged  variable,  unlike  an  exogenous  variable,  cannot  be  inde- 
pendent of  the  random  component  of  the  model.  In  (3-1)  a  lagged 
value  of  y  is  necessarily  correlated  with  some  past  value  of  u,  because 
2/(0  and  u(t)  are  clearly  correlated.  Therefore,  the  very  specification 
of  (3-1)  rules  out  Simplifying  Assumption  6. 

But  why  worry  about  such  models?  Because  (3-1)  and  its  generali- 
zations express  in  linear  form  oscillations,  decay,  and  explosions,  which 
are  all  of  great  interest  and  which  are,  indeed,  the  bread  and  butter  of 
physics,  astronomy,  and  economics.  For  instance,  springs  behave 
substantially  like 

2/(0  =  a  +  yiy(t  -  1)  +  y2y(t  -  2)  +u(t) 
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and  radioactive  decay  and  pendulums  like 

y(t)  -  yy(t  -  1)  +  «(0  (3-2) 

Business  cycles  are  more  complicated,  involving  several  equations 
like  (3-1). 

Why  do  we  want  unbiased  estimates?  There  are  excellent  reasons. 
If  the  world  responds  to  our  actions  with  some  delay  or  if  we  respond 
with  delay  to  the  world,  in  order  to  act  correctly  we  need  to  know  the 
parameters  accurately.  How  hot  the  water  in  the  shower  is  now 
depends  on  how  far  I  had  turned  the  tap  some  seconds  ago.  If  my 
estimate  of  the  parameter  expressing  the  response  of  water  temperature 
to  a  turn  of  the  tap  is  biased,  this  means  that  I  freeze  or  get  scalded 
or  that  I  alternate  between  these  two  states,  and,  in  any  event,  that 
I  reach  a  comfortable  temperature  much  later  than  I  would  with  an 
unbiased  estimate. 

In  economics,  consumers,  businesses,  and  governments  act  like  a 
man  in  a  shower.  The  information  they  get  about  prices,  sales, 
orders,  or  national  income  comes  with  some  delay  and  reflects  the 
water  temperature  at  the  tap  some  time  ago.  Moreover,  it  takes  time 
to  decide  and  to  put  decisions  into  effect.  If  the  decision  makers 
have  misjudged  how  strong  are  the  natural  damping  properties  of  the 
economy,  decisions  and  policy  will  either  overshoot  or  undershoot  the 
mark,  or  alternate  between  overshooting  and  undershooting  it,  and 
will  cause  uncomfortable  and  unnecessary  oscillations  in  economic 
activity. 

Our  discussion  will  now  be  confined  to  the  simplest  possible  case 
(3-2).  Let  consumption  this  year  y(t)  depend  on  consumption  last 
year  y(t  —  1),  as  in  (3-2).  If  the  relationship  involved  a  constant 
term  a,  we  eliminate  a  by  measuring  y  not  from  the  origin  but  from  its 
equilibrium  value.  I  shall  illustrate  my  argument  by  a  concrete 
example  where  the  true  y  has  the  convenient  value  0.5  and  where  the 
initial  value  Y  is  fixed  and  equal  to  24. 

In  Fig.  6,  line  OP  represents  the  exact  relationship  yt  =  0.5yt-i. 

3,3.  Conjugate  samples 

In  model  (3-1)  witli  fixed  initial  conditions,  We  can  describe  a  sample 
completely  by  mentioning  two  things:  (1)  what  time  periods  it  includes 
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and  (2)  what  values  the  disturbances  took  on  in  those  periods.    For 
example,  (a,b,c,d)  in  Fig.  6  is  completely  described  by 

[  8  -  1,    2,    3,    4] 
[u.  =  4,    0,    0,    OJ 

(a' tbf  ,c' ,df)  is  described  by 

[s=      1,    2,    3,    4] 
[u.  =  -4,    0,    0,    OJ 

and  (a',&',d',e)  by 


s  -      1,    2,    4,    5 
w.  =  -4,    0,    0,    0 


If  w  is  symmetrically  distributed,  all  conceivable  samples  of  size  S 
that  one  can  draw  from  the  universe  can  be  arranged  in  conjugate  sets. 
We  shall  see  that  in  each  conjugate  set  the  maximum  likelihood  estimates 
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Fig.  6.  Conjugate  disturbances. 

of  7  average  to  less  than  the  true  value  of  y  and,  therefore,  that 
maximum  likelihood  estimates  are  biased  for  all  samples  of  size  S. 

These  propositions  need  to  be  qualified  if  S  =  1  or  if  7  is  not  between 
0  and  1;  they  are  proved  if  u(t)  is  normally  distributed,  but  only 
conjectured  if  u(t)  has  some  other  symmetrical  distribution. 

For  an  introduction  to  the  concept  of  conjugate  samples,  consider 
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Fig.  6,  which  depicts  two  of  the  many  possible  courses  that  events  can 
follow  under  our  assumptions  that  y  —  0.5  and  Y  =  24.  One  course 
is  represented  by  the  points  a,  6,  c,  d,  e,  .  .  .  ;  the  other  by  a',  &',  c', 
d',  .  .  .  .  In  the  first  course,  the  disturbance  is  equal  to  +4  in  period  1 
and  zero  thereafter.  In  the  second  course,  it  is  —4  in  period  1  and  zero 
thereafter.  The  samples  S(+)  =  (a,b,c,d)  and  S(  — )  =  (af ,b' ,c' ,d') 
are  conjugate  samples,  and  form  a  conjugate  set.  Similarly,  (a,b,c)  and 
(a',6',c')  form  a  conjugate  set. 

To  be  conjugate,  two  samples  must  be  drawn  from  the  same  time 
span  s  =  1,  2,  .  .  .  ,  S;  and  the  disturbances  ua  that  contributed  to 
corresponding  observations  must  have  the  same  absolute  value  in  the 
two  samples.  This  definition  is  for  consecutive  samples  only.  Appen- 
dix D  extends  it  to  the  nonconsecutive  case. 

Thus,  sample 

«*-U:J!]. 

forms  a  conjugate  set  all  by  itself. 


has  as  its  conjugate 


Sample 


s  m  3,    4,      5,    6 
ut  =  0,    0,     17,    0 


[  •  -  3,    4,  5,    6] 

[u,  =  0,    0,     -17,    OJ 


f  •  -  4,    5,  6,        71 

[u,  -  0,     1,  0,     -9J 

has  three  conjugates,  the  following: 

[  «  =  4,        5,    6,        7]        f.f-4,    5,    6,    7] 
[u.  =  0,     -1,    0,     -9 J        [w,«0,    1,    0,    9 J 


s  =  4,        5,    6,    7 
u,  =  0,     -1,    0,    9 
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The  greatest  conjugate  set  of  samples  of  size  S  is  2*,  where  k  (0  < 
k  <  S)  represents  the  number  of  nonzero  disturbances.  If  S  »  4, 
the  largest  conjugate  set  contains  16  samples. 


3.4.  Source  of  bias 

In  Fig.  6,  line  OR  with  slope  ^[S(+)]  =  0.6053  is  the  least  squares 
regression  through  the  origin,  and  sample  S(+)  =  (a,&,c,e£);  and  OR' 
with  slope  t[S(— )]  =  0.3545  is  the  same  for  the  conjugate  (a',b',c\d'). 
The  line  OR  overestimates  y  because  OR  is  pulled  up  by  point  a.  The 
line  OR'  underestimates  y  because  of  the  downward  pull  of  point  a'. 
As  we  have  lA  (0.6053  +  0.3545)  =  0.4799  <  y,  the  downward  pull 
is  the  stronger.  But  why?  Because  point  a  is  accompanied  by  6,  c,  d, 
and  a!  by  b'f  c',  d'.  The  primed  points  b't  c'f  d'  are  closer  to  the  origin 
than  the  corresponding  unprimed  points;  hence,  their  "leverage"  on 
their  least  squares  line  OR'  is  weaker  than  the  leverago  of  the  unprimed 
points  on  theirs  (line  OR).  It  is  impossible  for  a'  to  be  accompanied 
by  b,  c,  d,  because  all  future  periods  must  necessarily  inherit  whatever 
impulse  was  first  imparted  by  the  random  term  of  period  1.  Points  &', 
c;,  d'  inherit  a  negative  impulse,  and  points  6,  c,  d  inherit  a  positive 
one. 

Another  way  of  stating  this  is  by  referring  to  (3-1).  In  (3-1)  one 
of  the  z's  (say,  z4)  is  a  lagged  value  of  y  (say,  the  lag  is  2  time  periods). 
It  follows  that  Zi(t)  is  correlated  with  the  past  value  of  the  disturbance 
u(t  —  2),  since  y(t)  is  clearly  correlated  with  u(t). 

All  the  proofs  of  bias  later  in  this  chapter  and  in  Appendix  D  are 
merely  fancy  versions  of  what  I  have  just  shown  for  this  special  case. 
When  conjugate  sets  are  large,  arguments  from  the  geometry  of  Fig.  6, 
though  perfectly  possible,  become  confusing,  and  so  we  turn  to 
algebra. 

With  fixed  initial  condition  2/(0)  =  F,  the  maximum  likelihood  esti- 
mate of  7  is  the  least  squares  estimate 


^  V'V-i 


1  -  ^ (3-3) 
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3.5.  Extent  of  the  bias 


From  (3-3)  and  (3-2), 


1  -  7  +  -^ (3-4) 


We  write  the  above  fraction  iV//>.  We  shall  see  that  the  bias  N/D 
varies  with  the  true  value  of  7,  the  size  of  the  sample,  and  the  size 
of  the  initial  value  F.  For  instance,  in  small  samples  it  is  almost 
25  per  cent;  in  samples  of  20  observations,  it  is  about  10  per  cent  of 
the  true  value  of  7.  It  never  disappears,  no  matter  what  value  true  7 
may  have  or  how  large  a  sample  one  takes. 

By  applying  (3-2)  repeatedly  and  letting  P,  Q,  and  R  stand  for 
polynomials,  we  get 


2V  = 

(wi  +  7W2  4-  y2uz  -f 

•  • 

•  + 

yB-hisW  +  P(uu  .  .  . 

,u8) 

D  = 

(1  +  72  4-  74  +  •  •  • 

+ 

^2(S- 

-«)F«  +  YQ(y,uh  .  .  . 

,us~ 
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+  R(uh 

.   .   . 

,1*8- 

-l) 

By  considering  N/D,  one  can  establish  that  the  bias  is  aggravated  the 
further  7  is  from  4-1  or  —  1  and  the  smaller  the  sample.  Bias  exists 
even  when  7  =  ±1  or  when  7  =  0;  the  latter  is  truly  remarkable, 
since  the  model  is  then  reduced  to  y(t)  —  u(t).  Since  2V  is  a  linear 
function  of  Y  and  the  always  positive  denominator  D  is  a  quadratic 
function  of  F,  the  bias  N/D  can  be  quite  large  for  certain  ranges  of  F. 
The  above  results  generalize  to  model  (3-1),  although  it  is  not  easy 
to  say  whether  the  bias  is  up  or  down. 

3.6.  The  nature  of  initial  conditions 

The  following  fantastically  artificial  example  illustrates  the  concept 
of  conjugate  samples  and  what  it  means  for  initial  conditions  2/(0)  to 
be  random  or  fixed. 

An  outfit  that  runs  automatic  cafeterias  has  its  customers  use, 
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instead  of  coins,  special  tokens  made  of  copper.  The  company  has 
several  cafeterias  across  the  country,  but  its  customers  rarely  think 
of  taking  their  leftover  tokens  with  them  when  they  travel  OP  move 
from  city  to  city.  As  there  is  at  most  one  cafeteria  per  city,  each 
cafeteria's  tokens  are  like  independent,  closed  monetary  systems. 
Let  us  look  at  a  single  cafeteria  of  this  kind. 

Originally  it  had  coined  a  number  of  brand-new  tokens  and  put 
them  in  circulation,  using  y(0)  pounds  of  copper.  Thereafter,  the 
amount  of  copper  in  the  tokens  is  subject  to  two  influences.  (1)  To 
begin  with,  the  tokens  wear  out  as  they  are  used.  The  velocity  of 
token  circulation  is  equal  in  all  cities,  and  customers'  pockets,  hands, 
keys,  and  other  objects  that  rub  against  the  tokens  are  equally  abra- 
sive in  all  cities.  Thus,  in  each  city,  year  t  inherits  only  a  part  7 
(0  <  7  <  1)  of  the  coppor  circulating  in  the  previous  year*  (2)  In 
addition  to  the  systematic  factor  of  wear  and  tear,  random  influences 
are  at  play.  First,  some  customer's  child  now  and  then  swallows  a 
token;  this  disappears  utterly  from  circulation  into  the  city's  lowers. 
However,  occasionally  there  is  an  opposite  tendency.  An  amateur 
but  successful  counterfeiter  mints  his  own  token  now  and  then,  or  a 
lost  token  is  found  inside  a  fish  and  put  back  into  circulation.  So 
the  copper  remaining  in  circulation  is  described  by  tha  Itochastic 
model  (3-2).  The  problem  for  the  company  is  how  to  estimate  the 
true  survival  rate  of  its  tokens. 

It  is  very  important  to  interpret  correctly  our  first  assumption  that 
"u(t)  is  a  random  variable  in  each  time  period  t."  It  means  that  u(t) 
is  capable  of  assuming  at  least  two  values  (opposites,  if  u  m  sym- 
metrical) in  the  same  period  of  time.  But  how  can  it?  Here  we 
need  a  concept  of  conjugate  cities  analogous  to  conjugate  samples. 
Imagine  that  the  only  positive  disturbances  come  from  one  counter- 
feiter and  that  the  only  negative  disturbances  come  from  one  child, 
the  counterfeiter's  child,  who  swallows  tokens.  The  counterfeiter  is 
divorced,  the  child  was  awarded  to  the  mother,  and  the  two  parents 
always  live  in  separate  cities,  say,  Ames  and  Buffalo;  but  who  lives 
where  in  year  t  is  decided  at  random.  Ames  and  Buffalo  are  conjugate 
cities,  because,  when  one  experiences  counterfeiting,  +u(t)t  the  other 
necessarily  experiences  swallowing,  —  u(t).  If  there  were  more  families 
like  this  one,  the  set  of  conjugate  cities  would  have  to  expand  enough 
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to  accommodate  all  permutations  of  the  various  values  that  ±u(t) 
is  capable  of  assuming. 

We  have  fixed  initial  conditions  if  each  cafeteria  starts  with  the 
same  poundage,  and  random  initial  conditions  when  the  initial  pound- 
age is  a  random  variable.  To  estimate  the  token  survival  rate, 
different  procedures  should  be  used  in  the  two  cases. 

3.7.  Unbiased  estimation 

Unbiased  estimation  of  7  is  possible  only  if  the  initial  copper  endow- 
ment is  a  fixed  constant  Y.  The  only  unbiased  estimate  is  given  by 
the  ratio  of  the  first  two  successive  ?/'s  using  data  from  a  single  city: 

~  _  2/(1)  _  2/(1)  ,0  KN 

7  "  W)  "  ~Y-  (3"5) 

which  is  a  degenerate  least  squares  estimate. 

This  result  is  really  startling.  It  says  that  we  must  throw  out  any 
information  we  may  have  about  copper  supply  anywhere,  except  in 
year  0  and  year  1  in,  say,  Ames.  Unless  we  do  this  we  can  never 
hope  to  get  an  unbiased  estimate.  Estimating  7  without  bias  when 
each  city  starts  with  a  different  amount  of  copper  is  an  impossible 
task.  A  complete  census  of  copper  in  all  cities  (i)  in  two  successive 
years  would  give  the  correct  (not  just  unbiased)  estimate 


-  T     I     -!»_    =  y 


2  2/.<0 

y   m    i 

I  y<«  -  i)         '  I  *»  ~  1) 

i  i 

We  can  draw  another  fascinating  conclusion:  If  we  have  the  bad 
luck  to  start  off  with  different  endowments,  we  can  never  get  an 
unbiased  estimate  of  7.  But  suppose  we  find  that  the  endowments 
of  all  cities  happen  to  be  equal  later,  say,  in  period  t  —  1.  Then  all 
we  have  to  do  is  wait  for  the  next  year,  measure  the  copper  of  any 
one  city,  say,  Buffalo,  and  compute  the  ratio 

?  -  ^-  (3-6) 

Vt-i 
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which  is  an  unbiased  estimate.  (Where  the  information  would  mine 
from,  that  all  cities  have  an  equal  token  supply  in  y©ar  i  —  1,  is 
another  matter.) 

The  experimental  scientist  is,  however,  free  from  such  predicaments. 
If  he  thinks  radium  decays  as  in  (3-2),  then  he  can  make  initial  con- 
ditions equal  by  putting  aside  in  several  boxes  equal  lumps  of  radium. 
Then  he  can  let  them  decay  for  a  year,  remeasure  them,  apply  (3-6) 
to  the  contents  of  any  one  box,  and  average  the  results.  Any  One  box 
gives  an  unbiased  estimate.  Averaging  the  contents  of  several  boxes 
gives  an  estimate  that  is  efficient  as  well  as  unbiased. 

The  econometrician  cannot  control  his  initial  conditions  in  this  way. 
If  he  wants  an  unbiased  estimate,  he  must  throw  away,  as  prescribed, 
most  of  his  information,  use  formula  (3-6),  and  thus  get  an  unbiased 
and  inefficient  estimate.  Or  else  he  may  decide  that  he  wants  to 
reduce  the  variance  of  the  estimate  at  the  cost  of  introducing  some 
bias;  then  he  will  use  a  formula  like  that  of  Exercise  3.C  below  or 
some  more  complicated  version  of  it. 

Autoregressive  equations  are  related  to  the  moving  average,  a  tech- 
nique commonly  employed  to  interpolate  data,  to  estimate  trends, 
and  to  isolate  cyclical  components  of  time  series.  The  statistical 
pitfalls  of  estimating  (3-2)  plague  time  series  analysis,  and  they  are 
not  the  only  pitfalls.  The  last  chapter  of  this  book  returns  to  some 
of  these  problems. 

Exercises 

3.A    Prove  that  (3-5)  is  unbiased. 
3.B     Prove  that  y  =  y(2)/y(l)  is  biased. 
3.C    Prove  that  y  =  [y{2)  +  y(l)]/[y(l)  +  Y]  is  biased, 
3.1)    Let  ut  in  (3-2)  have  the  symmetrical  distribution  q(uf)  with 
finite  variance.     "Symmetrical"  means  q(ut)  =  q(  —  ut).    Then  the 

likelihood  function  of  a  random   consecutive  sample  is    {f  q(ut). 

Prove  that  the  maximum  likelihood  estimate  of  y  is  obtained  by 
minimizing  the  expression  Y  log  g(^,),  where  the  ft,  are  the  vertical 

a 

deviations  from  the  line  that  we  are  seeking. 
3.E    By  the  method  of  conjugate  samples  or  by  any  other  method, 
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prove  or  disprove  the  conjecture  that  the  estimate  of  Exercise  3.D  is 
biased. 


Further  readings 

The  reader  who  wants  to  see  for  himself  how  intricate  is  the  statistical 
theory  of  even  the  simplest  possible  lagged  model  (3-2)  may  look  up  "Least- 
Squares  Bias  in  Time  Series,"  by  Leonid  Hurwicz,  chap.  15  of  Koopmans, 
pp.  3G5-383.  Tintner,  pp.  255-260,  gives  examples  and  shows  additional 
complications. 


CHAPTER  4 


Pitfalls  of  simultaneous 
interdependence 


4.1.  Simultaneous  interdependence 

"Everything  depends  on  everything  else"  is  the  theme  song  of  the 
Economic  and  the  Celestial  Spheres.  It  means  that  several  con- 
temporaneous endogenous  variables  hang  from  one  another  by  means 
of  several  distinct  causal  strings.  Thus,  there  are  two  causal  (more 
politely,  functional)  relations  between  aggregate  consumption  and 
aggregate  income :  Since  people  are  one  another's  customers,  consump- 
tion causes  income,  and,  since  people  work  to  eat,  income  causes 
consumption.  The  two  relationships  are,  respectively,  the  national 
income  identity  in  its  simplest  form 

Vt  =  ct  (4-1) 

and  the  (unlagged)  stochastic  consumption  function  in  its  simplest 
form 

ct  -  a  +  pyt  +  ut  (4-2) 

Wo  can  imagine  that  causal  forces  flow  from  the  right  to  the  loft  of 
the  two  equality  signs. 
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The  moral  of  this  chapter  is  that,  if  endogenous  variables,  like  c 
and  y,  are  connected  in  several  ways,  like  (4-1)  and  (4-2),  every 
statistical  procedure  that  ignores  even  one  of  the  ways  is  bound  to  be 
wrong.  The  statistical  procedure  must  reflect  the  economic  inter- 
dependence. 


4.2.  Exogenous  variables 

I  shall  not  vouch  for  the  Heavens,  but  in  economics  there  are  such 
things  as  exogenous  variables.  A  variable  exogenous  to  the  economic 
sphere  is  a  variable,  like  an  earthquake,  that  influences  some  economic 
variables,  like  rents  and  food  prices,  without  being  influenced  back. 
The  random  term  u  is,  ideally,  exogenous — though  in  practice  it  is  a 
catchall  for  all  unknown  or  unspecified  influences,  exogenous  or  endoge- 
nous. One  thing  is  certain:  Earthquakes  and  such  are  not  influenced 
by  disturbances  in  consumption.  Indeed,  the  definition  of  an  exoge- 
nous variable  is  that  it  has  no  connection  with  the  random  component 
of  an  economic  relationship. 

My  prototype  exogenous  variable,  investment  z,  is  not  really 
exogenous  to  the  economic  system,  especially  in  the  long  run,  but  we 
shall  bow  to  tradition  and  convenience  for  the  sake  of  exposition. 

4.3.  Haavelmo's  proposition 

The  models  in  this  chapter,  like  the  single-equation  models  treated 
m  far,  (1)  are  linear  and  (2)  have  all  the  Simplifying  Properties. 
Therefore,  they  are  subject  to  all  the  pitfalls  I  have  pointed  out  so 
far.  Unlike  the  models  of  Chaps.  1  to  3,  the  new  models  each  contain 
at  least  two  equations.  Most  of  my  examples  will  have  precisely 
two  (and  not  three  or  four)  for  convenience  only,  since  the  results 
can  easily  be  extended. 

New  kinds  of  complication  arise  when  a  second  equation  is  added. 

1.  The  identification  problem 

It  is  sometimes  impossible  to  estimate  the  parameters — this  problem 
is  side-stepped  until  Chap.  6. 
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2.  The  Haavelmo1  problem 

The  intuitively  obvious  way  of  estimating  the  parameters  of  a 
two-equation  model  is  wrong,  even  in  the  simplest  of  cases,  where  one 
of  the  equations  is  an  identity.  We  shall  see  that  pedestrian  methods 
are  unable  to  estimate  correctly  the  marginal  propensity  to  consume 
out  of  current  income,  no  matter  how  many  years  of  income  and 
consumption  data  we  may  have.  Even  infinite  samples  overestimate 
the  marginal  propensity  to  consume.  This  difficulty  is  as  strategic 
as  it  sounds  incredible.  It  means  that  the  multiplier  gets  overesti- 
mated and,  hence,  that  counterdepression  policies  will  undershoot  full 
employment  and  counterinflation  policies  will  be  too  timid.  Because 
of  bad  statistical  procedures,  the  cure  of  unemployment  or  inflation 
comes  too  slowly. 

The  model  is  as  follows: 

ct  =  a  ■+■  $yt  -j-  ut        (consumption  function)     (4-2) 
ct  +  zt  —  Vt  (income  identity)  (4-3) 

where  zt  (investment)  is  exogenous,  and  ut  has  all  the  Simplifying 
Properties.2  We  shall  illustrate  by  assuming  the  convenient  values 
a  =  5,  0  =  0.5. 

In  Fig.  7,  line  FG  represents  the  true  relation  ct  —  5  +  0.5yt.  When 
the  random  disturbance  is  positive,  the  line  moves  up;  with  negative 
disturbance,  it  moves  down.  Lines  HJ  and  KL  correspond,  respec- 
tively, to  random  errors  equal  to  +2  and  —2.  OQf  the  45°  line 
through  the  origin,  represents  equation  (4-3)  for  the  special  case  in 
which  investment  z  is  zero.  In  the  years  when  investment  is  zero, 
the  only  combinations  of  income  and  consumption  we  could  possibly 
observe  will  have  to  lie  on  OQ,  because  nowhere  else  can  there  be 
equilibrium.     If,  for  instance,  in  years  1900  and  1917  investment  had 

1  For  reference  to  Haavelmo,  see  Further  Readings  at  the  end  of  this  chapter. 

2  To  be  specific,  Assumption  6  in  this  case  requires  that  u  and  z  shall  not  influence 
each  other,  either  in  the  same  time  period  or  with  a  lag.  But  the  random  term  u 
cannot  be  independent  of  y.  The  reason  is  that  a  and  0  are  constants,  z  is  fixed 
outside  the  economic  sphere,  and  u  comes,  so  to  speak,  from  a  table  of  random 
numbers;  if  this  is  so,  then,  by  equations  (4-2)  and  (4-3),  a,  /?,  z,  ana  u  necessarily 
determine  y  (and  c).  Thus  variable  y  is  not  predetermined  but  codetermined 
with  c.    These  statements  summarize  and  anticipate  the  remainder  of  the  chapter. 
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been  zero  and  if  the  errors  had  been  +2  and  —2,  respectively,  then 
points  P  and  P'  would  have  been  observed. 

Let  us  now  suppose  that  in  some  years  investment  zt  equals  3. 
Line  MN  (also  45°  steep)  describes  the  situation,  which  is  that 
ct  +  3  »  yt.  With  errors  ut  =  ±2,  the  observable  points  are  at  R 
and  R'.  With  errors  ranging  from  —2  to  +2,  all  observable  points 
fall  between  R  and  R'. 


Fig.  7.  The  Haavelmo  bias. 


Let  us  now  pass  a  least-squares-regression  line  through  a  scatter 
diagram  of  income  and  consumption,  minimizing  squares  in  the  vertical 
sense  and  arguing  that  from  the  point  of  view  of  the  consumption  function 
Income  causes  consumption,  not  vice  versa.  Such  a  procedure  is 
bound  to  overestimate  the  slope  p  of  the  consumption  function  and 
to  underestimate  its  intercept  a.     This  is  Haavelmo' s  proposition. 

The  least  squares  line  (in  dashes)  corresponds  to  observation  points 
that  lie  in  the  area  PP'R'R.  It  is  tilted  counterclockwise  relative  to 
the  true  line FG  because  of  the  pull  of  "extreme"  points  in  the  corners 
next  to  R  and  P'.  The  less  investment  z  ranges  and  the  bigger  the 
stochastic  errors  u  are,  the  stronger  is  the  counterclockwise  pull, 
because  lines  PP'  and  RR'  fall  closer  together. 

This  overestimating  of  /3  persists  even  if  we  allow  investment  to 
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range  very  far*  Though  it  is  true  that  the  parallelogram  PP'R'R  gets 
longer  and  longer  toward  the  northeast  (say,  it  becomes  PW'P')  the 
fact  remains  that  V  and  P',  the  extreme  corners,  help  to  tilt  the  least 
squares  line  upward.  This  suggests  that  perhaps  we  ought  to  mioi- 
mize  squares  not  in  a  vertical  direction  but  in  a  direction  running  from 
southwest  to  northeast.  In  this  particular  case  (though  not  generally) 
diagonal  least  squares  are  precisely  correct  and  equivalent  to  the 
procedure  of  simultaneous  estimation  described  in  the  following  section. 

4.4.  Simultaneous  estimation 

We  know  that  two  relations,  not  one,  account  for  the  slanted  posi- 
tion of  the  universe  points  in  Fig.  7.  Had  the  consumption  function 
been  at  work  alone,  a  given  income  change  Ay  would  result  in  a  change 
in  consumption  Ac  =  0  Ay.  Had  the  income  identity  been  at  work 
alone,  then  to  the  same  change  in  income  would  correspond  a  larger 
change  in  consumption  Ac  =  Ay.  In  fact,  both  relations  are  at  work. 
Therefore,  the  total  manifest  response  of  consumption  to  income  is 
neither  Ac  =  0  Ay  nor  Ac  =  Ay}  but  something  in  between.  This  is 
why  the  line  in  dashes  is  steeper  than  FG  (and  less  steep  than  OQ). 

In  order  to  isolate  the  0.  effect  from  a  sample  of  points  like  PP'R'R, 
both  relations  must  be  allowed  for.  This  is  done  by  rewriting  the 
model : 

01  I  P  I  Ut  /A     M\ 

v*  -  r=Tf  +  T=^Zt  +  n^  <"> 

The  term  ut/(l  —  0)  has  the  same  properties  as  ut  except  that  it 
has  a  different  variance.  Therefore  the  error  term  in  the  new  model 
has  all  the  Simplifying  Properties.  Either  of  the  new  equations  con- 
stitutes a  single-equation  model  with  one  endogenous  variable  (c  and  y, 
respectively)  and  one  independent  variable  (z  in  both  cases).  There- 
fore, the  estimating  techniques  of  Sec.  2.5  can  be  applied  to  the  sophisti- 
cated parameters  a'  =  «/(l  -  0),  71  =  0/(1  -  0).  72  =  1/(1  -  0). 
Denote  these  estimates  by  the  hat  (A).  For  the  naive  least  squares 
estimate  of  a  and  0,  derived  from  regressing  c  on  y,  use  the  bird  (v). 
Let  us  now  express  these  estimates  in  terms  of  moments,  and  let  us  do 
it  for  0,  71,  and  72  only,  leaving  aside  a  and  a'. 
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a     _      ■  ft        _.  Mc. 
1  -  j§         ™*« 

1  -  0      m„ 


|  Is  a  biased  estimate  of  ft,  because 
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and  it  is  known  that 


/3  is  inconsistent,  because 


e  5bs  ^  o  (4-6) 


a  W(«-h^+«)(>+a-f«)  ^  ft  +  (1  +  j3)(mu</m„)  4-  Tnuu/mgg 
^d+a+u)(i+a+u)  1  +  2mut/mzt  +  muu/mtt 

The  various  moments  mcv,  muu,  etc.,  vary  in  value,  of  course,  from 
sample  to  sample.  As  the  sample  size  approaches  the  population 
size,  however,  mul  approaches  cov  (u,z)  =  0,  muu  approaches  var  u  >  0, 
and  m„  approaches  var  z  >  0.    Therefore, 

Plim  ^  =  fl  +  vartt/vars 
1  +  var  w/var  z 

Exercises 

4. A    In  similar  fashion  prove  that  Plim  a  <  a. 
4.B    Interpret  (4-6). 

4.C  Show  that  1/(1  —  /§)  is  a  biased  estimate  of  1/(1  —  ft). 
Hint:  manipulate  the  expression 

1  1 


1  —  $       l  —  mcy/mvv 

and  use  the  fact  that  e(mv,/m„)  =  1/(1  -  ft). 

4.D    Prove  that  fi  is  an  unbiased  and  consistent  estimate  of 
0/(1  -  0). 
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4.E    Prove  that  ^2  is  an  unbiased  and  consistent  estimate  of 

1/(1  -  fl. 

4.F  Prove  that  1 1  and  f2  yield  a  single  compatible  estimate  of  0, 
which  we  call  0;  $  ^met/myy. 

4.G    Prove  that  $  is  a  biased  but  consistent  estimate  of  #. 

4.H  From  the  facts  that  /§  «  0  -f  mu,/rnyv  and  that  #  =  $  +  muy/mnt 
argue  that  the  bias  of  $  is  less  serious  than  the  bias  of  #. 

Digression  on  directional  least  squares 

What  do  we  get  if,  in  Fig.  7,  we  minimize  the  sum  of  th©  square 
deviations  not  vertically  but  from  the  southwest  to  the  n§?thP&st? 
Let  P(y,c)  in  Fig.  8  stand  for  any  point  of  the  sample;  p  ®  tZ  is 
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Fig.  8.  Directional  least  squares. 


parallel  to  the  45°  line.  Let  0  be  the  angle  of  inclination  of  the 
true  consumption  function;  that  is,  let  tan  6  be  the  slope  of  the 
curve  a  +  Py  =  0.  Then  in  triangle  PZM,  from  the  law  of  sines, 
we  have 


u 


sin  <f> 
from  which  it  follows  that 


sin  (90  +  6) 


Vt 


V2  V2   , 


l-^ 


1-/3 


to) 
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Then, 

2j  P'  "  (1  -  fl)l  2/  M'  "  (1  Z  0)2  m(o-«-/»V)(o-«HJv) 

Setting  c  =  */  —  *, 

J  Pi  =  (0  j  1)2  l«W.  +  tf  -  !)2^  +  208  -  1)«J 
Minimizing  Spj  with  respect  to  0  —  1,  we  obtain 


1      _  myt 


1  —  0      mM 
that  is  to  say,  the  same  expression  that  we  found  for  *? 2. 

4.5.  Generalization  of  the  results 

Section  4.3  showed  the  pitfalls  of  ignoring  the  income  identity  in 
estimating  the  consumption  function,  and  Sec.  4.4  showed  how  to  get 
around  this  difficulty  by  the  technique  of  simultaneous  estimation, 
which  takes  into  account  the  entire  model  even  though  the  investigator 
may  be  interested  in  only  a  part.  Chapters  5  to  9  deal  with  the 
intricacies  of  simultaneous  estimation  and  various  approximations 
thereof. 

To  prepare  the  way,  let  us  enlarge  the  model  slightly,  by  making 
Investment  respond  to  income.    The  new  model  is 

ct  =  a  +  0y,  +  ut  (4-2) 

it  "-  Zi  +  7  +  tyt  +  Vt  (4-7) 

ct  +  it  =  Vt  (4-8) 

where  zt  is  autonomous  investment;  it  is  total  investment;  and  ut}  vt  are 
random  disturbances  independent  of  each  other  and  of  present  and 
past  values  of  z.  The  last  sentence  is  a  statement  of  Simplifying 
Assumption  7,  which  will  be  explained  and  justified  in  the  next  chapter. 
Letting  st  —  yt  —  ct  stand  for  saving,  we  obtain  from  (4-2)  the 
saving  function 

8t  -  -a  +  (1  -  fi)yt  -  n  (4-9) 

Figure  9  shows  saving  SS  and  investment  77  as  functions  of  income, 
with  zero  disturbances  (thick  lines)  and  disturbed  by  ±uf  ±v,  respec- 
tively (thin  lines),  and  with  the  usual  (stable)  relative  slopes. 
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Naive  least  squares  applied  to  Fig.  9  underestimate  the  slop©  I  —  0 
of  SS  (as  it  underestimates  the  slope  of  OQ  in  Fig.  7)  and,  hence,  again 
overestimates  the  marginal  propensity  to  consume. 


The  Haavelmo  bias  again. 


4.6.  Bias  in  the  secular  consumption  function 

We  have  shown  that  naive  curve  fitting  overestimates  the  slope  of 
the  consumption  function,  even  with  large  samples  and  whether  or  not 
investment  is  a  function  of  income.  Statistical  fits  of  the  secular 
consumption  function  give  a  slope  varying  from  over  0.95  to  nearly 
1.0,  contradicting  the  lower  figures  given  by  budget  studies,  introspec- 
tion, and  Keynes's  hunch.  To  reconcile  these  facts,  consumption 
theories  of  imitation,  irreversible  behavior,  and  more  and  more 
explanatory  variables  have  been  invoked.  A  large  part  of  what  these 
ingenious  theories  account  for  can  be  explained  by  Haavelmo's 
proposition. 

Further  readings 

Trygve  Haavelmo's  proposition  was,  apparently,  stated  first  in  "The 
Statistical  Implications  of  a  System  of  Simultaneous  Equations"  (Econo- 
melrica,  vol.  11,  no.  1,  pp.  1-12,  January,  1943),  but  a  later  article  of  his 
applying  the  proposition  to  the  consumption  function  has  attracted  far  more 
attention.  This  has  appeared  in  three  places:  Trygve  Haavelmo,  "Methods 
of  Measuring  the  Marginal  Propensity  to  Consume"  (Journal  of  the  American 
Statistical  Association,  vol.  42,  no.  237,  pp.  105-122,  March,  1947);  reprinted 
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as  Cowles  Commission  Paper  22,  new  series;  and  again  as  chap.  4  of  Hood, 
pp.  75-91.  Haavelmo  gives  numerical  results  and  confidence  intervals  for 
the  parameter  estimates. 

Jean  Bronfenbrenner,  "Sources  and  Size  of  Least-squares  Bias  in  a  Two- 
equation  Model,"  chap.  9  of  Hood,  pp.  221-235,.  extends  Haavelmo's  propo- 
sition to  three  more  special  cases.  An  early  article  by  Lawrence  R.  Klein, 
"A  Post-mortem  on  Transition  Predictions  of  National  Product"  {Journal  of 
Political  Economy,  vol.  54,  no.  4,  pp.  289-308,  August,  1940),  puts  the 
Haavelmo  proposition  in  proper  perspective,  as  indicating  only  one  of  the 
many  sources  of  malestimation. 

Milton  Friedman,  A  Theory  of  the  Consumption  Function  (New. York: 
National  Bureau  of  Economic  Research,  1957),  also  compares  and  discusses 
rival  measurements  of  consumption,  but  his  main  concern  is  to  test  the 
Permanent  Income  hypothesis  and  to  refine  the  consumption  functions,  not  to 
discuss  econometric  pitfalls.  It  contains  valuable  references  to  the  literature 
of  the  consumption  function. 

According  to  Guy  H.  Orcutt,  "  Measurement  of  Price  Elasticities  in  Inter- 
national Trade"  {Review  of  Economics  and  Statistics,  vol.  32,  no.  2,  pp.  117-132, 
May,  1950),  Haavelmo's  proposition  explains  why  exchange  devaluation  had 
been  underrated  as  a  cure  to  balancc-of-payments  difficulties.  Orcutt  con- 
fines mathematics  to  appendixes  and  gives  many  further  references. 

Tjalling  C.  Koopmans  in  "Statistical  Estimation  of  Simultaneous  Economic 
Relations"  {Journal  of  the  American  Statistical  Association,  vol.  40,  no.  232, 
pt.  1,  pp.  448-4G6,  December,  1945),  discusses  the  Haavelmo  proposition  with 
the  help  of  a  supply-and-dcmand  example  and  with  interesting  historical 
comments.  When  the  random  disturbances  are  viewed  not  as  errors  of 
observation  clinging  to  specific  variables  but  as  errors  of  the  econometric 
relationship  itself,  then  they  affect  all  simultaneous  endogenous  variables 
symmetrically,  and  Haavelmo's  problem  rears  its  head.  The  Koopmans 
article  is  a  good  preview  of  the  next  chapter. 


CHAPTER  S 

Many- equation  linear  models 


5.1.  Outline  of  the  chapter 

The  moral  of  Chap.  4  is  this:  if  a  model  has  two  equations  they 
cannot  be  estimated  one  at  a  time,  each  without  regard  for  the  other, 
because  both  take  part  together  in  generating  the  phenomena  from 
which  we  draw  samples.  This  fact  rules  out,  except  in  special  cases, 
the  use  of  the  pedestrinr  technique  of  naive  least  squares.  Both  the 
moral  and  the  reasom  ehind  it  remain  in  force  as  the  number  of 
equations  in  the  model  1.  ^ 

The  present  chapter  is  rathe  nimportant,  and  might  be  skipped  or 
skimmed  at  first.     All  its  principles  are  implicit  in  Chap.  4. 

The  main  task  of  Chap.  5  is  to  systematize  the  study  of  many- 
equation  linear  models.  First  we  present  some  standard  and  effort- 
saving  notation  (Sec.  5.2).  Next,  we  review  the  Simplifying  Assump- 
tions, which  were  originally  introduced  for  one-equation  models  in 
Chap.  1,  to  see  precisely  how  they  extend  to  the  general  case  (Sec.  5.3). 
With  two  or  more  equations,  a  seventh  Simplifying  Assumption  is 
required,  that  of  stochastic  independence  among  the  equations  (Sec.  5.4). 

The  presence  of  several  simultaneous  equations  in  a  model  compli- 
cates the  likelihood  function  with  the  term  det  J,  which  we  have 

73 


74  MANY-EQUATION   LINEAR  MODELS 

ignored  until  now;  in  intricate  fashion  det  J  involves  the  parameters  of 
all  equations  in  the  system.  (The  last  proposition  merely  restates  the 
moral  of  Chap.  4.)  The  digression  on  Jacobians  explains  what  det 
J  is  doing  in  the  likelihood  function. 

If  we  heed  the  moral  to  the  letter  and  take  det  J  into  account,  we 
get  into  awfully  long  computations  (see  Sec.  5.5)  in  spite  of  all  our 
original  Simplifying  Assumptions. 

Whether  computations  are  long  or  short,  it  pays  to  lay  them  out  in 
an  orderly  way.  This  is  a  general  precept,  of  course,  but  its  value 
stands  out  most  dramatically  in  the  present  chapter.  It  pays  not  only 
to  do  computations  in  an  orderly  manner  but  also  to  perform  some 
redundant  ones  just  in  case  you  might  want  to  check  some  alternative. 
Econometricians  normally  settle  down  to  a  specific  model  only 
after  much  experimentation.  And,  further,  redundant  computations 
become  necessary  when  we  want  to  estimate  a  given  promising  model 
by  increasingly  refined  techniques.  The  wisdom  of  performing  the 
redundant  computations  will  become  fully  apparent  only  after  we  have 
dealt  with  ovcridentified  systems,  instrumental  variables,  limited 
information,  and  Theirs  method  (Chaps.  6  to  9). 

5.2.  Effort-saving  notation 

It  pays  to  establish  once  and  for  all  a  uniform  notation  for  complete 
linear  models  of  several  equations.  These  are  conventions,  not 
assumptions. 

The  endogenous  variables  are  denoted  by  y'a.  There  are  G  endoge- 
nous variables,  called  y\y  ytt  ■»••■$  Vo  an(^>  collectively,  y.  y  is  the 
vector  (2/1,2/2,  .  •  .  ,2/c).  We  use  g  (g  «  1,  2,  .  .  .  ,  G)  as  running 
subscript  for  endogenous  variables. 

The  exogenous  variables  are  denoted  by  z'q.  There  are  H  exoge- 
nous variables,  called  Zi,  z%%  .  .  .  %  znf  and  z  is  their  vector.  These 
may  be  lagged  values  of  the  y'a  only  by  special  mention.  The  running 
subscript  of  an  exogenous  variablo  is  h  «■  1,  2,  .  .  .  ,  H. 

All  definitions  have  been  solved  out  of  the  system,  so  that  there  are 
exactly  G  equations,  all  stochastic,  with  errors  U\,  u%,  .  .  .  ,  uq. 
u  =  (uifu2t- .  .  .  ,uG).     We  speak  of  the  gth  equation. 

The  coefficients  of  the  y'a  are  called  /3s,  and  those  of  the  z'a  are  called 


5.2.   EFFORT-SAVING  NOTATION 


75 


7s.  They  bear  two  subscripts:  the  first  refers  to  the  equation,  the 
second  to  the  variable  to  which  the  parameter  corresponds. 

We  get  rid  of  the  constant  term  (if  any)  by  letting  the  last  exogenous 
variable  zh  be  identically  equal  to  1 ;  its  parameter  y0H  then  becomes 
the  constant  term.  In  most  applications  we  shall  not  bother  to  write 
the  constant  term  at  all.  Either  it  is  in  the  last  term  y0HZH  =  1,  or  it 
has  been  eliminated  by  measuring  all  variables  from  their  means. 

B  and  T  represent  the  matrices  of  coefficients  in  their  natural  order: 


B  - 


■]8ii 

021 


012 

022 


-001       002 


010" 
020 


^GQ~ 


r  = 


7n      7i2 

721        722 


-7oi     702 


71//"1 

72/f 


7o#. 


A  stands  for 


B  is  always  square  and  of  size  G  XG;T  is  of  sizeCr  X  H. 
the  elements  of  B  and  r  set  side  by  side: 

A  =  [Br] 

that  is  to  say,  for  the  matrix  of  all  coefficients  in  the  model,  whether 
they  belong  to  endogenous  or  exogenous  variables.  A  is  of  size 
GX(G  +  H). 

x  stands  for  the  elements  of  y  and  z  set  side  by  side. 


x  =  (2/1,2/2, 


,2/o  J  Zi,Z2, 


>zh) 


that  is  to  say,  x  is  the  vector  of  all  variables,  whether  endogenous  or 
exogenous,  but  in  their  natural  order. 

ai  stands  for  the  first  row  of  A,  on  for  the  second  row,  etc. ;  similarly, 
for  gi,  5a,  ...  ,  5o,  71,72,  •  .  •  ,  Yo-  That  is,  a  lower-case  bold  Greek 
letter  with  a  single  subscript  g  represents  (some  of)  the  parameters  of  a 
single  equation  (the  oth)  of  the  system. 

We  reduce  the  number  of  parameters  to  bo  estimated  by  dividing 
each  equation  by  one  of  its  coefficients.  This  does  not  affect  the 
model  in  any  other  way.  We  use  the  pth  coefficient  of  the  oth  equation 
for  this,  so  that  0W  =  L  Henceforth  we  shall  always  take  matrix  5 
in  its  ''standardized  form" 
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B  = 


'  1      0u 

021  1 


01O~ 
020 


-001       0(72       '    '    '  1   - 

A  model  can  be  written  in  a  variety  of  forms: 
1.  Explicitly,  as  below  (time  subscripts  omitted) : 


yi+  Pay  %  + 

02i2/i  +  y%  + 


+  01(72/(7  +  711*1  +  71222  + 
+  0202/0  +  72l2l  +  722^2  + 


4-  yiHZn  =  ui 
+  ImZn  -  u2 


0(712/1  +  0(722/2  +••••+  Jto  +  701«1  +  7(72«2  + 


2.  In  extended  vector  form) 


?iy  +  yiz  =  wi 

?2y  +  Y2Z  =  uz 


5oy  4-  yoz  ■  wo 
3.  In  condensed  vector  form: 

ait  =  Wi 

at*  =  Wa 


+  7off*#  ■  Wo 
(5-D 


(5-2) 


(5-3) 


€10%   —  Uq 

4.  In  extended  matrix  form: 

By  +  Tz  =  u 

5.  In  condensed  matrix  form: 

Ax  =  u 


(5-4) 
(5-5) 
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Note  that,  when  the  context  is  clear,  bold  lower-case  letters  stand 
either  for  a  row  or  for  a  column  vector.  S 

Finally,  <rgh(t)  stands  for  the  co variance  of  ug(t)  with  uh(t) ;  #&(£)  is 
the  matrix  of  these  co  variances;  and  6°h(t)  is  its  inverse. 

5.3.  The  Six  Simplifying  Assumptions  generalized 

A  laconic  mathematician  can  generalize  the  Six  Simplifying  Assump- 
tions with  a  stroke  of  brevity  by  saying  that  they  continue  to  apply  if 
we  replace  the  symbol  u  by  u.  Our  task  is  to  interpret  this  in  terms  of 
economics. 

In  Chap.  1,  I  discussed  the  Six  Simplifying  Properties  when  there 

was  a  single  equation  in  the  model  and,  therefore,  a  single  disturbance. 

Now  we  have  one  disturbance  for  each  equation,  and  u  is  the  vector 

made  up  of  them,  u(t)  =  (wi(0,W2(0>  •  •  •  »wo(0)» 

i 
Assumption  1 

"u  is  a  random  variable"  means  that  each  u0(t)  is  a  random  variable, 
that  is  to  say,  that  all  equations  remaining  after  solving  out  the 

definitions  are  stochastic. 

1 

Assumption  2 

"u  has  expected  value  0"  means  that  the  mean  of  the  joint  distribu- 
tion is  the  vector  0  =  (0,0,  .  .  .  ,0),  or  that  each  u0  has  zero  expected 
value. 

Assumption  3 
"u  has  constant  variance"  means  that  the  covariances 
(T0h  =  cov  (u0}uh) 
of  the  several  disturbances  do  not  vary  with  time. 

Assumption  4 

"u  is  normal"  means  that  Wi(0,  u2(t),  .  .  .  ,  uQ(t)  are  jointly 
normally  distributed. 

Assumption  5 

"u  is  not  autocorrelated"  means  that  there  is  no  correlation  between 
the  disturbance  of  one  equation  and  previous  values  of  itself. 
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Assumption  0 

"u  is  not  correlated  with  z"  means  that  no  exogenous  variable — in 
whichever  equation  it  appears — -is  correlated  with  any  disturbance, 
past,  present,  or  future,  of  any  equation  in  the  model. 

On  these  assumptions,  the  likelihood  function  of  the  sample  is 

s 
L  -  (2»)-"»(det  J)s(det  fa*])"*'2  exp  {  -  %  J  u,[<^]u.}     (5-6) 

9—1 

which  should  be  compared  with  (2-2).    The  analogy  is  perfect.    The 
expression  in  the  curly  braces  can  also  be  written 


«     g     h 

Another  way  to  write  the  likelihood  function  is 


^JJ«B(5K«AW  (5-7) 


L  m  (2ir)-s>*(det  J)s(det  M)"5'2 exp  {  -\i  £  Ax(«)[d'*]x(a)A}     (5-8) 

« 

which  brings  out  the  fact  that  L  is  a  function  (1)  of  ail  the  unknown 
parameters  @0h,  y0h,  <rvhy  <T0h)  and  (2)  of  all  the  observations  \(s)  (s  =  1, 2, 
.  .  .  ,  S).    The  function's  logarithmic  form 

S-1  log  L  «  -  Y2  log  2t  +  log  det  J  -  Y2  log  det  [4gh] 

8 

-Vl^I  Ax(s)[d'*]x(s)A     (5-9) 
is  easier  to  use. 

5,4.  Stochastic  independence 

The  seventh  Simplifying  Assumption:  6gti  is  a  diagonal  matrix,  or 

cov  (u0)uh)  =  0        for  g  j*  h 

is  not  obligatory,  but  it  is  easy  to  rationalize.     It  states  that  the 
disturbance  of  one  equation  is  not  correlated  with  the  disturbance  in  any 
other  equation  of  the  model  in  the  same  time  period — something  quite 
different  from  Assumption  6. 
Recall  that  each  random  term  is  a  gathering  of  errors  of  measure- 
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ment,  errors  of  aggregation,  omitted  variables,  omitted  equation!,  and 
errors  of  linear  approximation.  Assumption  7  states  that  either 
(1)  the  #th  equation  and  the  /ith  equation  are  disturbed  by  different 
random  causes,  or  (2)  if  they  are  disturbed  by  the  same  causes,  dif- 
ferent "drawings"  go  into  ug(t)  and  uh(t).  This  assumption  is  dearly 
inapplicable  in  the  following  situations: 

1.  In  year  t,  all  or  nearly  all  statistics  were  subject  to  larger  than  the 
usual  errors,  because  of  a  cut  in  the  budget  of  the  Statistics  Bureau. 

2.  Errors  of  aggregation  affect  mainly  national  income  (because  of 
shifts  in  distribution),  and  national  income  enters  several  equations  of 
the  model. 

3.  Omitted  variables  (one  or  more)  are  known  to  affect  two  (qf  more) 
equations.  For  instance,  weather  affects  the  supply  of  watermelons, 
cotton,  and  whale  blubber.  Now  if  the  model  contains  equations  for 
watermelons  and  blubber,  the  inclusion  of  weather  in  the  random  term 
does  not  hurt,  because  relatively  independent  drawings  of  waather 
(one  in  the  Southeast,  one  in  the  South  Pacific)  affect  these  two 
industries.  However,  if  watermelons  and  cotton  are  included  in  the 
model,  both  of  these  are  grown  in  the  same  belt,  the  weather  affecting 
them  is  one  and  the  same,  and  Assumption  7  is  violated. 

Assumption  7  simplifies  the  computations  (1)  because  it  leaves  fewer 
covariances  to  estimate,  (2)  because  det  6oh  becomes  a  simple  product 
Uffggj  and  (3)  because  all  the  cross  terms1  in  (5-7)  drop  out.  This  can 
reduce  computations  by  a  factor  of  2  or  3  for  a  model  of  as  few  as  three 
equations  and  by  a  much  greater  factor  for  larger  systems. 

Digression  on  Jacobians 

The  likelihood  function  involves  a  term  expressed  m  det  J»  the 
Jacobian  of  the  functions  u,  say,  with  respect  to  the  variables  y; 
we  have  disregarded  det  J  until  now,  since  we  have  taktn  it  on 
faith  to  be  equal  to  1.  This  is  no  longer  true  in  a  many-equation 
model.  Here  J  is  a  matrix  of  unknown  parameters,  the  same  /Js, 
in  fact,  that  we  are  trying  to  estimate  with  the  likelihood  function. 

The  main  ideas  behind  J  are  three: 

1.  If  you  know  the  probability  distribution  of  a  variable  u  (or 
several  variables  uh  w2,  .  .  .  ,  Uq),  then  you  can  find  the  proba* 
1  Those  for  which  g  ^  h. 
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bility  distribution  of  a  variable  y  related  to  u  functionally  (or  of 
several  y's  related  functionally  to  the  w's). 

2.  If  the  w's  and  t/'s  are  equally  numerous  and  if  the  functions 
connecting  the  two  sets  are  one-to-one,  continuous,  and  with  con- 
tinuous first  derivatives,  then  the  matrix  J  of  all  partial  deriva- 
tives of  the  form  du/dy  will  have  an  inverse. 

3.  If  conditions  I  and  2  arc  satisfied,  then  wo  can  calculate  the 
joint  probability  distribution  q  of  the  y'a  from  the  known  joint 
probability  distribution  p  of  the  w's  (omitting  the  subscript  t)  as 
follows: 


(5-10) 


p(ui,u2,  .  .  •  ,uG)  du\  du2  '  '  •  dug 

=  det  J  •  p(uhu2,  .  .  .  ,u0)  dyx  dy2  *  •  •  dy0 
or    q(yi,y2,  .  .  .  ,:!/o)  dyi  dy2  •  •  •  dya 

=  det  J  •  p(uhu2i  .  .  .  ,u0)  dyi  dy2  •  •  •  dyG 

I  shall  illustrate  these  three  ideas  by  examples. 

Example  1 

Let  u  be  a  single  variable  whose  probability  distribution  we 
know  to  be  as  follows: 


Value  of  u 

Probability  p(u) 

-4 

0.1 

-3 

0.2 

1 

0.4 

3 

0.3 

Let  y  be  related  functionally  to  u  as  follows: 

y(u)  -  w2  -  4w  +  3  (5-11) 

As  u  takes  on  its  four  values,  y  takes  on  the  corresponding 
values  ?/(-4)  =  35,  t/(-3)  =  24,  y{\)  =  0,  r/(3)  =  0.  Since  we 
know  how  often  u  is  equal  to  —4,  —3,  1,  and  3,  we  can  find  how 


oiten  y  is  equal  to  oo,  zi,  ana 

u. 

Value  of  y 

35 

24 

0 

Probability  ^(?/) 
0.1 
0.2 
0.7 

5.4.  STOCHASTIC  INDEPENDENCE  $1 

Example  2 

The  same  can  be  done  with  several  y's  and  u's  conneetid  by  m 
appropriate  set  of  functions,  for  instance, 

2/i  =  -t*i  -|-  Zu\  -  Ui 
2/2  =  e~ui  +  log  u2 

provided  the  probability  distribution  p(wi,Wj,Wa)  is  known. 

Relation  (5-11)  is  not  one-to-one,  since,  for  every  value  of  yt  U 
can  have  two  values.  Accordingly,  in  Example  1  the  second 
condition  is  violated,  and  the  Jacobian  is  undefined.  The  same  is 
true  for  Example  2. 

Whenever  the  functional  relation  between  the  w's  and  y's  is 
one-to-one,  [du/dy]  and  [dy/du]  are  single-valued  and  their 
determinants  multiply  up  to  the  number  1. 


Example  3 


y(u)  =  3w  -f  log  u  —  4 


Though  it  is  very  hard  to  express  w  in  terms  of  y,  we  know 
that,  since  dy/du  =  3  +  \/u  =  (3w  +  l)/w,  the  Jacobian 
j  =  du/dy  -  w/(3w  +  1). 


Example  4 


2/i  «  —  Ui  +  w2 

2/2  -  e~ui  +  log  u2  +  5 


(5*12) 


Here  we  can  compute  det  J  from  knowledge  of 


det 


raj- -i 

L^WJ  W2 


+  e-tt» 


since  it  follows  that  det  J  =  u2/(u2e-"x  —  1).    Therefore,  by 
(5-10),  the  probability  distribution  of  the  y's  is 


9(2/1,2/2)  d?/i  efo/2  = 


^2 


w2e"WI  - 1 


p(uhu2)  dyx  dy* 


Now,  what  relevance  does  all  this  have  to  econometrics? 
Very  simple.  Let  2/1,  ytt  •  •  •  >  Vq  De  endogenous  variables,  and 
let  uh  u2,  .  .  .  ,  uQ  be  the  random  errors  attached  to  the  struc* 
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tural  equations.  The  model's  G  equations  are  explicit  functional 
relations  between  the  y's  and  the  u'a,  like  (5-12).  Directly,  we 
know  nothing  at  all  about  the  probability  of  this  or  that  combina- 
tion of  2/'s.  Nevertheless,  (5-10)  allows  us  to  compute  this  proba- 
bility, namely,  in  terms  of  J  and  the  probability  distribution 
of  the  u's.  It  turns  out  that  the  right-hand  side  of  (5-10)  involves 
only  the  parameters  we  seek,  the  observations  we  can  make,  and 
the  probability  distribution  p,  which  we  have  already  specified 
when  we  constructed  the  model. 

If  the  structural  equations  are  all  linear,  as  in  (5-1),  the 
matrix  J  of  all  partial  derivatives  of  the  form  du/dy  turns  out 
to  be  nothing  but  the  matrix  B  itself. 

1         #12        *    *    *       01G 

021  1  *    '    *       020 


J- 


-&?i       002       *    '    '  1    - 


-  B 


5.5.  Interdependence  of  the  estimates 

Now  that  we  know  that  J  =  B,  we  can  both  find  the  values  /3,  7,  <r 
that  maximize  the  likelihood  function  (5-9)  and  compute  its  actual 
value.    Actually  we  do  not  care  how  large  L  itself  is. 

Naturally,  maximizing  such  a  function  by  ordinary  methods  is  a 
staggering  job;  we  won't  undertake  it.  In  fact,  nobody  undertakes  it 
by  direct  attack.  We  shall  use  (5-9)  to  answer  the  following  question: 
In  order  to  estimate  this  particular  parameter  or  this  particular 
equation,  do  we  need  to  estimate  all  parameters?  The  answer, 
generally,  is  yes. 

Note  first  of  all  that  the  maximum  likelihood  method  of  estimating 
B,  r,  and  60h  differs  from  the  naive  least  squares  method  quite  radically, 
because  the  least  squares  method  does  not  involve  the  term  log  det  B 
at  all.  In  other  words,  the  least  squares  method,  if  applied  to  the 
model  one  equation  at  a  time,  omits  from  account  the  matrix  B;  it 
does  not  allow  the  parameters  of  one  equation  to  influence  the  estima- 
tion of  the  parameters  of  another;  nor  does  it  allow  the  covariances  a0h 
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to  influence  in  the  least  the  parameter  estimates  of  any  equation  that 
is  being  fitted. 

Finally,  the  least  squares  technique  estimates  the  covariances  &g9 
one  at  a  time  without  involving  any  other  covariance.  Contrariwise, 
in  maximum  likelihood,  the  estimates  $  of  one  equation  affect  the  $s 
and  1  s  of  another;  the  &s  of  one  equation  affect  the  /3s  and  f  s  of  another; 
and  one  &  affects  another. 

In  a  word,  the  sophisticated  maximum  likelihood  method  is  very 
expensive  from  the  point  of  view  of  computations  and  is  probably 
more  refined  than  the  quality  of  the  raw  statistical  data  warrants. 
Econometric  theory  is  like  an  exquisitely  balanced  French  recipe, 
spelling  out  precisely  with  how  many  turns  to  mix  the  sauce,  how 
many  carats  of  spice  to  add,  and  for  how  many  milliseconds  to  bake 
the  mixture  at  exactly  474  degrees  of  temperature.  But  when  the 
statistical  cook  turns  to  raw  materials,  he  finds  that  hearts  of  cactus 
fruit  are  unavailable,  so  he  substitutes  chunks  of  cantaloupe;  where  the 
recipe  calls  for  vermicelli  he  uses  shredded  wheat;  and  he  substitutes 
green  garment  dye  for  curry,  ping-pong  balls  for  turtle's  eggs,  and,  for 
Chalifougnac  vintage  1883,  a  can  of  turpentine. 

Two  courses  of  action  are  open  to  the  econometrician  who  is  reluc- 
tant to  lavish  refined  computations  on  crude  data: 

1.  Use  the  refined  maximum  likelihood  method,  but  reduce  the 
burden  of  computation  by  making  additional  Simplifying  Assumptions. 

2.  Water  down  the  maximum  likelihood  method  to  something  more 
pedestrian  but  not  quite  so  naive  as  least  squares.  Limited  informa- 
tion}  instrumental  variable,  and  other  techniques  are  available;  they 
are  the  subject  of  Chaps.  7,  8,  and  9. 

5.6.  Recursive  models 

If  B  is  a  triangular  matrix,1  the  model  is  called  recursive;  and  its 
computation  is  lightened,  because  there  are  fewer  0s  to  estimate  and 
because  det  B  =  1. 

The  economic  interpretation  of  a  recursive  model  is  the  following. 
There  is  an  economic  variable  in  the  system  (say,  the  price  of  coffee 
beans)  that  is  affected  only  by  exogenous  variables  (like  Brazilian 
weather) ;  next,  there  is  a  second  economic  variable  (say,  the  price  of  a 

1  B  is  triangular  if  pgh  =  0  for  all  g  <  h. 
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cup  of  coffee)  that  is  affected  by  exogenous  variables  (tax  on  coffee 
beans)  and  by  the  one  endogenous  variable  (price  of  coffee  beans)  just 
mentioned.  Next,  there  is  a  third  economic  variable  (say,  the  number 
of  hours  spent  by  employees  for  coffee  breaks)  that  depends  only!  on 
exogenous  variables  (the  amount  of  incoming  gossip)  and  (one  or  both 
of)  the  first  two  endogenous  variables  but  no  others;  and  so  on. 

Exercises 

5.A    In  the  recursive  system 

2/i  =  7*i  +  u 

V2  =  Pyi  -f  yiZi  +  y2z2  +  v 

let  the  Simplifying  Properties  hold  for  u,  v  with  respect  to  the  exoge- 
nous variables.  Prove  that,  if  $  is  estimated  by  naive  least  squares, 
that  is,  if 

S  sg  ^(vi.gi.«i)(vi.gi.«i) 

W(Vlt«|.*l)(l/l.«l.*l) 

then  |5  is  biased. 
5.B    In  the  recursive  model 

xt  =  Pyt  +  ut 
yt  =  yxt-\  4-  vt 

show  that  j5  and  ^  are  unbiased  but  that  least  squares  applied  to  the 
autoregressive  equation  obtained  as  a  combination  of  the  two  equations 
gives  biased  estimates. 

Further  readings 

The  notation  of  Sec.  5.2  is  worth  learning  because  it  is  becoming  standard 
among  econometricians.     It  is  expanded  in  Koopmans,  chap.  2. 

Jacobians  are  illustrated  by  Klein,  pp.  32-38.  The  mathematics  of 
Jacobians,  with  proofs,  can  be  found  in  Richard  Courant,  Differential  and 
Integral  Calculus,  vol.  2,  chap.  3  (New  York:  1953),  or  in  Wilfred  Kaplan, 
Advanced  Calculus,  pp.  00-100  (Reading,  Massachusetts:  1952). 

Klein,  p.  81,  gives  a  simple  example  of  a  recursive  model. 


CHAPTER  6 

Identification 


6.1.  Introduction 

Identification  problems  spring  up  almost  everywhere  in  econometrics 
as  soon  as  one  departs  from  single-equation  models.  This  chapter  far 
from  exhausts  the  subject.  In  particular,  the  next  two  topics, 
instrumental  variables  in  Chap.  7  and  limited  information  in  Chap.  8, 
are  intimately  bound  up  with  it.  The  identification  problem  will  arise 
sporadically  in  later  chapters. 

Though  this  chapter  is  self-contained,  some  familiarity  with  the 
subject  is  desirable.  I  know  of  no  better  elementary  treatment  than 
that  of  Tj ailing  C.  Koopmans,  "Identification  Problems  in  Economic 
Model  Construction,"  chap.  2  in  Hood.  I  have  chosen  to  devote  this 
chapter  to  a  few  topics  which,  in  my  opinion,  either  have  not  received 
convincing  treatment  or  have  not  been  put  in  pedagogic  form. 

The  main  results  of  this  chapter  are  the  following: 

1.  There  are  several  definitions  of  identifiability.  I  show  their 
equivalence. 

2.  Lack  or  presence  of  identification  may  be  due  (a)  to  the  model's 
a  priori  specification,  (b)  to  the  actual  values  of  its  unknown  parame- 
ters, or  (c)  to  the  particular  sample  we  happen  to  have  drawn. 
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3.  There  are  ways  to  detect  overidentificatton  and  underidentifica- 
tion.  These  ways  are  not  always  foolproof.  There  are  several  ways 
to  remove  over-  or  underidentification. 

4.  In  spite  of  the  superficial  fact  that  they  are  defined  in  analogous 
terms,  underidentification  and  overidentificatioh  are  qualitatively  dif- 
ferent properties:  the  former  is  nonstochastic,  the  latter  stochastic; 
the  former  can  be  removed  (in  special  cases)  by  means  of  additional 
restrictions,  the  latter  is  handled  by  better  observations  or  longer 
computation. 

6,2,  Completeness  and  nonsingularity 

The  following  discussion  applies  to  all  kinds  of  models,  linear  or  not, 
largo  or  small,  but  it  will  bo  illustrated  by  this  example: 

2/i  +  YnZi  4  71222  +  71323  4  71424  =  ui 

0212/1  +  2/2  4  0232/3  4-  721Z1  4  72222  4  72323  *  Ui     (6-1) 

0312/1  4  2/3  4  73121  4  73222  =   W8 

This  model  describes  an  economic  mechanism  that  works  somewhat 
like  this: 

1.  The  parameters  /?  and  7  are  fixed  constants. 

2.  In  each  time  period,  someone  supplies  outside  information  about 
the  exogenous  variables  z. 

3.  In  each  time  period,  someone  goes  to  a  preassigned  table  of 
random  numbers,  and,  using  a  prescribed  procedure,  reads  off  some 
numbers  uh  w2,  us. 

4.  All  this  is  fed  into  (6-1). 

5.  Values  for  the  endogenous  variables,  2/1,  2/2,  2/s,  are  generated  in 
accordance  with  the  resulting  system. 

The  last  step  succeeds  if  and  only  if  the  linear  equations  resulting  from 
step  4  are  independent.  Otherwise  there  is  an  infinity  of  compatible 
triplets  (2/1,2/2,2/3).  The  model  is  complete  if  it  can  be  solved  uniquely 
for  (2/1,2/2,2/3);  otherwise  it  is  incomplete.  To  generate  a  unique 
triplet  it  is  necessary  and  sufficient  that  the  matrix  B  be  nonsingular, 
meaning  that  no  row  of  it  is  a  linear  combination  of  other  rows. 

The   economic   interpretation   of   singularity   and   nonsingularity 
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is  very  simple.  Each  equation  in  (6-1)  represents  the  behavior 
of  a  sector  of  the  economy,  say,  producers,  consumers,  bankers, 
buyers,  sellers,  or  middlemen.  These  sectors  respond  to  exogenous 
stimuli  z  and  economic  stimuli  y.  They  may  respond  to  exogenous 
stimuli  in  any  way  whatsoever.  In  particular,  it  is  quite  permis- 
sible for  them  to  respond  in  the  same  way  to  all  exogenous  stimuli 
(711  =  721  =  T3i,  712  =  722  =  732,  etc.).  But,  if  the  matrix  is  to  be 
nonsingular,  they  should  respond  in  different  ways  to  the  endogenous 
stimuli.  No  sector  may  have  the  same  0  parameters  as  another;  no 
sector's  responses  may  be  the  average  of  two  other  sectors'  responses. 
No  sector  may  be  a  weighted  average  of  any  other  sectors,  as  far  aa 
economic  stimuli  are  concerned. 

To  illustrate  singularity,  consider  a  simple  economy  which  consists 
of  throe  families  responding  to  three  economic  stimuli  but  such  that  the 
third  family  makes  an  average  response.  Then  B  is  singular,  and  the 
model  containing  the  three  families  is  incomplete.  For  nonsingularity 
the  sectors  must  be  sufficiently  unlike  each  other.  In  fact  this  is  the 
definition  of  sectors:  that  they  are  economically  different  from  one 
another. 

Exercise 

6.  A  Prove  the  following  theorems  by  using  the  common  sense  of 
the  five  steps  of  the  above  discussion:  "If  Assumption  7  is  made, 
then  B  is  nonsingular,"  and  "If  B  is  singular,  Assumption  7  cannot 
hold."  These  two  statements  can  be  reworded:  "An  econometric 
model  is  complete  if  and  only  if  its  sectors  are  stochastically  inde*  II 
pendent."  Appendix  E  proves  this  mathematically,  but  what  is 
wanted  in  this  exercise  is  an  "economic"  proof. 

6.3.  The  reduced  form 

Every  complete  linear  model  By  -f  Tz  =  u  can  be  reduced  to 
y  =  IIz  +  v.  These  two  expressions  are  called  the  original  form  and 
the  reduced  form.  If  it  is  complete,  the  original  model  (6-1)  can  be 
reduced  to 

2/1    =   7TnZi  -f  7Ti2Z2  +  ITi&i  +  TuZa  +  V\ 

2/2   =   7T2lZl  +  *"22Z2  +  Vr&Z  +  ^24^4  +  V%  (6-2) 

2/3   =   7T3lZl  +  A-32Z2  +  Tz&i  +  ^34^4  +  #3 
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Some  obvious  properties  of  (6-2)  are  worth  pointing  out:  Its  random 
disturbances  Vi,  v*f  *>3  are  linear  combinations  of  the  original  random 
disturbances'and  share  their'properties.  However,  the  v's  have  different 
covariances  from  the  it's.  In  particular,  the  v's  are  interdependent 
even  if  the  u's  were  stochastically  independent.  (We  seldom  have  to 
worry  about  the  precise  relation  among  the  w's  and  v's.)  Unlike  the 
typical  original  form,  each  equation  of  the  reduced  form  contains  all 
the  exogenous  variables  of  the  model. 

Each  equation  of  the  reduced  form  constitutes  a  model  that  satisfies 
the  Six  Simplifying  Assumptions  of  Chap,  1  and,  therefore,  may  validly 
be  estimated  by  least  squares;  these  estimates  are  called  its.  If  it  is 
possible  to  work  back  from  the  its  to  estimate  unambiguously  the 
coefficients  /3,  7  of  the  original  form,  we  shall  call  such  estimates  /3,  -f 
and  say  that  (6-1)  is  exactly  identified.  Finally,  the  coefficients  of  the 
two  forms  (6-1),  (6-2)  are  connected  as  follows: 

—  7ll  =  7TH  —721  =  ^21^11  +  7T21  +  0237T31  731  =  /?3l7Tn  +  7T3i 
~7l2  =  1T12  —722  —  0217T12  +  7T22  +  023^32  732  =  Pn^l*  4*  ^32 
~7l3  =  ^13  —723  :=   ^21^13  +  T23  +  ^23^33  0  =  /33lTi3  4"  ^33 

—  714  =  ^14  0  :=   /?2l7Tl4  +  7T24  +  0237T34  0  =  ^ZlTTu  +  ""34 

(6-3) 

It  is  possible,  but  messy,  to  solve  for  im,  •-  •  •  ,  xt4  in  terms  of  the 
0s  and  7s.     The  important  fact  is  that,  in  general,  all  its  are  a  priori 
nonzero  in  the  reduced  form,  even  if  many  of  the  j8s  and  7s  are  a  priori 
zero  in  the  original  form. 
Relations  (6-3)  can  be  written  much  more  compactly: 

-r  =  Bn  (6-4) 

6.4.  Over-  and  underdeterminacy 

As  a  preview  for  the  rest  of  this  chapter,  imagine  that  (6-1)  is 
complete.  If  so,  its  reduced  form  (6-2)  exists  and  can  be  estimated  by 
least  squares.    Let  the  estimates  be  tfn,  .  .  .  ,  #34. 

Now  consider  the  leftmost  column  of  equations  in  (6-3).  Evidently 
the  7s  can  be  computed  right  away  from  the  7rs,  uniquely  and  unam- 
biguously.    We  say,  then,  that  711,  712,  713,  7i4  are  exactly  identified. 

Consider  next  the  last  two  equations  of  (6-3) ;  they  give  rise  to  two 
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estimates  of  03i,  namely,  —#33/^13  and  ~*fr u/fut  which  in  general  are 
quite  different,  no  matter  how  ideal  the  sample.  When  this  happt m 
to  parameters,  we  say  that  they  are  (or  the  equation  that  g@ii&ifi§ 
them  is)  overidentified;  accordingly,  system  (6-3)  over  deter  inirm  #31. 

Consider  now  the  middle  column  of  (G-3).  Its  four  equations 
underdetermine  the  five  unknowns  02i,  1823,  721,  722,  723.  An  equation 
to  which  such  parameters  belong  is  underidentified. 

Obviously,  then,  the  identification  problem  has  something  to  do 
with  the  number  of  equations  and  unknowns  in  the  system  —  F  m  Bn. 
The  Counting  Rules  of  Sec.  6.7  will  show  this  more  precisely, 

6.5.  Bogus  structural  equations 

Consider  the  supply-demand  model 

SS  (Supply)  yi  +  pl2y2  -  ux 

DD  (Demand)       02i2/i  +  2/2  =  u2  "**' 

where  y\  represents  price  and  y2  represents  quantity;  linear  combina- 
tions of  the  true  supply  and  demand  are  called  bogus  relations  and  are 
branded  with  the  superscript  © .  A  bogus  relation  may  parade  either 
as  supply  or  as  demand, 

SS®  =  j(SS)  +  k(DD)        DD®  =  m(SS)  +  n(DD) 

where  j,  k,  m,  n  are  unknown  numbers,  but  suitable  to  make  the 
standardized  coefficients  pflt  pf2  of  the  bogus  relations  equal  to  1. 

The  bogus  coefficients  are  connected  with  the  true  coefficients  as 
follows: 

Pfi  -  J  +  W21  =  1        /3?2  -  jfiit  +  k 

0®  -  m  -f-  nfai  pf2  =  m/3i2  +  n  =  1 

The  bogus  supply  contains  a  random  term 

u?  -  jui  +  ku2 

and  the  bogus  demand  contains  an  analogous  term 

uf  =  mu\  +  nu2 
Later  on  we  shall  use  the  following  relations  between  the  eovariances   ! 
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of  the  bogus  and  the  true  disturbances: 

var  uf  «   j2  var  ux  +  2jk  cov  (uhu2)  +  k2  var  w2 

var  uf  =  m2  var  U\  -f  2mn  cov  (1*1,^2)  +  n2  var  u2    (6-6) 

cov  (wf^f )  =  jm  var  Ui  -f*  (jn  +  m/c)  cov  (1*1,^2)  +  ^n  var  w2 

6.6.  Three  definitions  of  exact  identification 

The  discussion  that  follows  is  meant  to  apply  to  linear  models  only. 
Some  results  can  be  extended  to  other  types  of  models  (but  not  in  this 
work). 

A  model  or  an  equation  in  it  may  bo  cither  (exactly)  identified,  or 
underidcntificd,  or  over-identified.  Setting  aside  for  the  moment  the 
last  two  cases,  here  are  three  alternative  definitions  of  exact  identifica- 
tion, one  in  terms  of  the  statistical  appearance  of  the  model,  ~)  in 
terms  of  maxima  of  the  likelihood  function  L,  and  one  in  terms  of  the 
probability  distribution  of  the  endogenous  variables. 

Definition  1.  A  model  is  identified  if  its  structural  equations  "look 
different"  from  the  statistical  point  of  view.  An  equation  looks 
different  if  linear  combinations  of  the  other  equations  in  the  system 
cannot  produce  an  equation  involving  exactly  the  same  variables  as  the 
equation  in  question. 

Thus  the  supply-demand  model  (6-5)  is  not  exactly  identified, 
because  both  equations  contain  the  same  variables,  price  and  quantity. 

In  the  model 

SS  2/1  +  /3i2?/2  +  7ii*i  *=  ui 

DD        /?2i</i+      2/2  =w2  {p'° 

where  t\  represents  rainfall,  a  linear  combination  of  SS  and  DD 
contains  the  same  variables  as  SS  itself.  Not  so  for  DD,  because 
every  nontrivial  linear  combination  introduces  rainfall  into  the  demand 
equation.  In  this  model  the  demand  equation  is  identified,  but  the 
supply  is  not  exactly  identified.  In  such  cases  the  model  is  not  exactly 
identified. 

Definition  2.  A  model  is  identified  if  the  likelihood  function  L(S) 
has  a  unique  maximum  at  a  "point"  A  =  A0.  This  means  that,  if  you 
substitute  the  values  «°  in  L,  L  is  maximal;  at  any  other  point  L  is 
definitely  smaller.  Similarly,  an  equation  is  exactly  identified  if  the 
likelihood  function  L  becomes  smaller  when  you  replace  the  set  a£  of 
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that  equation's  parameters  by  any  other  set  of  «J.    This  way  of  teking 
at  the  matter  is  presented  in  detail  later  on,  in  Sec.  6.12. 

Definition  3.  Anything  (a  model,  an  equation,  a  parameter)  is 
called  exactly  identified  if  it  can  be  determined  from  knowledge  of  the 
conditional  distribution  of  the  endogenous  variables,  given  th©  exoge- 
nous. This  is  to  say,  it  is  identified  if,  given  a  sample  that  wag  large 
enough  and  rich  enough,  you  could  determine  the  parameters  in 
question.  We  know  that,  no  matter  how  large  the  sample  or  how 
rich,  we  could  never  disentangle  the  two  equations  of  (6-5). 

All  three  definitions  appear  to  say  that  exact  identification  is  not  & 
stochastic  property,  for  it  does  not  seem  to  depend  on  the  samples  w© 
may  chance  to  draw.     We  shall  return  to  this  question  later  on. 

One  must  be  very  accurate  and  careful  about  the  terminology. 
Over-,  under-,  and  exact  identification  are  exhaustive  and  mutually 
exclusive  cases.  Identified  means  "either  exactly  or  o  vender*  klfkd." 
Not  identified  means  "underidentified." 

Underidentification  occurs  when: 

By  linear  combinations  of  the  equations  one  can  obtain  a  bogus  equa- 
tion that  looks  statistically  like  some  true  equation  (Definition  1). 

The  likelihood  function  has  a  maximum  maximorum  at  two  or  more 
points  of  the  parameter  space  (Definition  2). 

Knowledge  of  the  conditional  distribution  of  the  endogenous  variables, 
given  the  exogenous,  does  not  determine  all  the  parameters  of  th© 
model  (Definition  3). 

There  are  three  principal  ways  to  avert  (or  at  least  to  detect) 
absence  of  exact  identification:  (1)  constraints  on  the  a  priori  values  of 
the  parameters;  (2)  constraints  on  the  estimates  of  the  parameters; 
(3)  constraints  on  the  stochastic  assumptions  of  the  model. 

6.7.  A  priori  constraints  on  the  parameters 

Two  new  symbols  will  speed  up  the  discussion  considerably.  Sup- 
pose we  are  discussing  the  third  equation  of  a  model.  A  single  asterisk 
will  denote  the  variables  present  in  the  third  equation,  a  double  asterisk, 
those  absent  from  the  third  equation.    Asterisks  can  be  attached  to 
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variables,  to  their  parameters,  or  to  vectors  of  such  variables  and 
parameters.  The  asterisk  notation  has  now  become  standard  in 
econometric  literature,  and  Appendix  F  gives  a  detailed  account  of  it. 

The  commonest  a  priori  restrictions  on  A  are  (1)  zero  restrictions, 
like  724  =  0;  (2)  parameter  equalities  in  the  same  equation,  for 
example,  721  =  722;  (3)  other  equations  involving  parameters  of 
several  equations  a  'priori. 

These  cases  have  economic  counterparts,  which  I  proceed  to 
illustrate. 

Zero  restrictions 

Zero  restrictions  are  common  and  handy.  A  zero  restriction  says 
that,  for  all  we  know>  such  and  such  a  variable  is  irrelevant  to  the 
behavior  of  a  given  sector.  If  nothing  but  zero  restrictions  are 
contemplated,  then  we  have  a  handy  counting  rule  (Counting  Rule  1) 
for  telling  whether  an  equation  is  identified. 

If  an  equation  of  a  model  contains  all  the  variables  of  the  model,  it  is 
underidentified,  because  linear  combinations  of  all  the  equations  look 
statistically  just  like  it.  To  avoid  this  underidentification,  the  follow- 
ing two  conditions  are  necessary: 

1.  That  some  variables  (call  them  #**)  be  absent  from  this  equation. 

2.  That  the  variables  (call  them  x*)  present  in  the  equation  in 
question,  whenever  they  appear  in  another  equation,  be  mixed  with  at 
least  one  £**, 

In  (6-1)  the  first  equation  is  identified,  because  any  intermixture  of 
the  second  equation  brings  in  variables  y**  and  y**  (double-starred 
from  the  point  of  view  of  the  first  equation),  and  intermixture  of  the 
third  equation  brings  in  y**t  which  is  absent  from  the  first  equation. 
In  (6-1)  the  second  equation  is  underidentified,  because  the  third 
equation  can  be  merged  into  it  without  bringing  in  any  variable  that 
is  not  already  in  the  pure,  uncontaminated  second  equation.  Finally, 
the  third  equation  is  identified,  because  an  intermixture  of  the  first 
equation  introduces  zj*  and  zf*,  and  intermixture  of  the  second 
equation  introduces  ?/**  and  z**  (the  double  stars  are  now  from  the 
point  of  view  of  the  third  equation). 

This  example  shows  that  underidentification  can  be  detected  by* 
checking  whether  given  strategic  parameters  in  the  model  are  specified 
a  priori  to  be  zero  or  nonzero.     This  justifies  the  following  statement: 
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Counting  Rule  1.  For  an  equation  to  be  exactly  identified  it  is 
necessary  (but  not  sufficient)  that  the  number  of  variables  absent 
from  it  be  one  less  than  the  number  of  sectors. 

Thus,  if  G*  and  H*  are,  respectively,  the  number  of  endogenous  and 
exogenous  variables  present  in  the  gth  equation,  then  for  the  gth 
equation  to  be  identified  it  is  necessary  (but  not  sufficient)  that 
G  +  H-Q3*  +  Hj)  =  G  -  1,  or  that  H  -  H*  «  G*  -  1. 

Parameter  equalities  in  the  same  equation 

Another  quite  common  a  priori  restriction  is  to  set  two  or  more 
parameters  of  a  given  equation  a  priori  equal.  For  instance,  let  us 
interpret  (6-1)  as  a  model  of  bank  behavior,  where  Z\  represents 
balances  of  banks  at  the  Federal  Reserve  and  Zi  represents  balances  of 
banks  at  other  banks.  It  is  conceivable  that  a  commercial  bank  may 
conduct  its  loan  policy  by  looking  at  its  total  balances  and  not  at 
whether  they  are  held  at  the  Federal  Reserve  or  at  another  bank. 
The  restriction  would  be  expressed  721  ~  722.  On  the  other  hand, 
some  other  sector,  say,  foreign  banks,  may  treat  the  two  kinds  of 
balances  differently,  732  5^  731.  Under  these  conditions,  if  the  third 
equation  is  intermixed  with  the  second  equation,  the  result  cannot 
masquerade  as  the  second  equation,  because  the  bogus  second  equation 
would  have  different  coefficients  for  z\  and  z2,  contrary  to  the  a  priori 
assumption  that  the  response  to  all  balances  (Federal  Reserve  and 
other)  is  identical. 

Linear  equations  connecting  the  parameters  of  different  equations 

Suppose  that  a  model  contains  a  production  function  and  an  equa~ 
tion  showing  the  distribution  of  national  income  by  factor  shares. 
Then  the  coefficient  of  the  share  of  labor  is  a  priori  equal  to  the  labor 
coefficient  of  the  production  function,  on  the  grounds  of  the  marginal 
productivity  theory  of  wages. 

Collectively,  all  the  linear  restrictions  on  A  discussed  so  far  can  be 
capsuled  into  Counting  Rule  2.  Let  A**  be  what  is  left  of  A  if  we  throw 
out  the  columns  corresponding  to  the  variables  present  in  the  gth 
equation. 

Counting  Rule  2.  For  the  gth  equation  to  be  exactly  identified  it  is 
necessary  and  sufficient  that  the  matrix  A**  have  rank  G  —  1. 
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These  tests  and  counting  rules  can  (and  should)  be  applied  before 
you  start  computations. 

There  are  no  convenient  counting  rules  for  nonlinear  restrictions  on 
the  parameters. 

Inequalities  such  as  a  >  0  or  a  >  0  do  not  help  to  remove  under- 
identification.  For  instance,  knowledge  that  demand  is  downward- 
sloping  and  that  supply  is  upward-sloping  does  not  help  to  identify 
the  model  (6-5). 

6.8.  Constraints  on  parameter  estimates 

Consider  again  the  supply-demand  model  (6-7).  It  states  a  priori 
that  rainfall  influences  supply  and  not  demand;  and  this  restriction 
identifies  the  demand  equation  (but  not  the  supply).  Now  imagine 
that  you  draw  an  unlucky  sample  made  up  of  cases  where  the  other 
random  elements  U\  have  annihilated  the  theoretical  effect  of  rainfall. 
You  will  get  y ii  (for  this  sample)  equal  to  zero.  The  sample  has 
behaved  as  if  rainfall  did  not  influence  supply,  i.e.,  as  if  the  model  were 
reduced  to  (6-5),  where  the  demand  was  statistically  indistinguishable 
from  the  supply. 

The  moral  of  this  is:  If  you  are  not  a  priori  certain  that  supply  is 
influenced  by  rainfall  (not  only  theoretically  but  also  in  the  sample 
period)  then  do  not  proceed  with  the  estimation  of  demand.  If  you 
fear  that  rainfall  fails  to  affect  supply  (whether  in  the  sample  or 
generally),  then  to  estimate  the  demand  introduce  in  the  supply 
function  another  variable  z2  (say,  last  year's  price  of  a  competing  crop) 
that  you  are  certain  influences  (if  ever  so  little)  this  year's  supply  both 
theoretically  and  in  the  sample  period.     The  new  model  then  is 

SS  V\  -f  0122/2  +  TllZl  +  712^2  «  wi  („  g. 

DD         foit/i  +2/2  =  u2 

and  so  last  year's  price  takes  on  the  burden  that  rainfall  is  supposed  to 
carry  in  making  the  demand  identifiable. 

A  very  neat  extension  of  Counting  Rule  2  covers  all  these  require- 
ments: For  exact  identification ,  the  ranks  of  A**  and  A**  must  equal 
G  -  1. 

We  can  show  this  in  a  third  way.1    If,  in  the  original  model  (6-7), 
1  With  acknowledgments  to  T.  C.  Koopmans,  in  Hood,  pp.  31-32. 


6.9.   CONSTRAINTS  ON  THE  STOCHASTIC  ASSUMPTIONS  95 

Yn  is  truly  nonzero,  then  it  is  impossible  to  construct  a  bogus  demand 
equation  without  detecting  it.    Take  as  the  bogus  demand 

DD®  =  %DD  +  %SS 

Then  the  bogus  random  term  of  demand  is 

©       2ui  +  u2      7n  . 

Then  cov  (uf,z\)  is  not  zero,  and  will  show  up  in  estimation,  unless 
the  sample  is  the  unlucky  one  in  which  rainfall  is  neutralized  (f  n  ~  0) 
by  the  random  factor.  If,  upon  completing  the  estimation,  we 
discover  that  mj,.,,  is  quite  different  from  zero,  then  we  can  detect 
underidentification  but  we  cannot  remove  it.  On  the  other  hand,  the 
discovery  that  w;,.,,  is  nearly  zero  is  no  guarantee  that  we  have 
identified  demand  if  there  is  a  strong  reason  to  suspect  that  supply  is 
unaffected  by  rainfall. 

6.9.  Constraints  on  the  stochastic  assumptions 

Let  the  random  terms  of  (6-5)  satisfy  Simplifying  Assumptions 
1  to  7,  so  that  cov  (uifu2)  =  0.  Will  this  help  to  identify  the  supply 
SS?  Sometimes.  Suppose  that  we  knew  beforehand  that  var  u\9 
cov  (ui,u2),  var  u2  were  of  the  orders  of  magnitude  3,  0,  10,  respec- 
tively. The  "deception"  can  be  detected  from  (6-6)  if  2(jB? )2,  which 
is  the  estimate  of  var  uf,  is  very  different  from  3.  This  can  have 
happened  by  chance  in  the  sample  used,  but  it  becomes  more  and 
more  unlikely  the  more  2(j2?)2  differs  from  3.  On  the  other  hand, 
the  bogus  variances  and  covariances  may  have  nothing  peculiar  about 
them — indeed  they  may  equal  3,  0,  and  10,  respectively,  because  of  a 
special  set  of  values  that  j,  k,  m,  and  n  have  taken  on,  for  example, 
J  —  %>  -h  =  }4>  m  ■■  —  M>  n  —  K-  Therefore,  in  general,  there  is 
no  guarantee  that  SS®  will  look  statistically  different  from  SS,  even 
if  we  have  complete  knowledge  of  the  underlying  covariances  of  the 
random  term. 

Another  way  to  impose  identification  on  a  model  is  to  say  something 
specific  about  the  variances  of  the  random  terms.    This  was  done  by 


96  IDENTIFICATION 

Schultz  in  some  early  studies  of  agricultural  markets.1  In  some  of 
Schultz's  work,  both  supply  and  demand  are  functions  of  the  same  two 
endogenous  variables  (price  and  quantity)  and  of  random  shocks. 
However,  supply  is  more  random  than  demand.  Then  the  scatter  of 
observed  points  will  be  more  in  agreement  with  the  demand  than  with 
the  supply  function.  Ambiguity  is  not  eliminated  entirely,  but  it  is 
reduced  as  the  randomness  of  supply  increases  relative  to  the  random- 
ness of  demand.  In  the  notation  of  (G-6),  the  restriction  takes  the 
form  var  U\  ■  q  var  w2;  and  identification  improves  with  increase  in  q. 

More  complex  restriction  of  this  kind  could  also  help. 

To  summarize  the  results  of  Sees.  6.7  to  G.9: 

1.  Identification  can  be  checked  before  computing  by  use  of  the 
Counting  Rules  as  applied  to  A. 

2.  If  you  fear  that  art  equation  is  underidentified  because  you  are  not 
sure  whether  a  given  variable  x  reacts  significantly,  estimate  the  equa- 
tion anyhow  and  then  check  whether  the  covariance  of  x  with  the 
residual  %  is  near  zero;  if  not,  you  may  have  identified  the  gth  equation. 
If  mx.ua  is  near  zero,  you  have  not  identified  your  equation.  If  the 
numerically  largest  determinant  of  rank  G  —  1  from  A**  is  close  to  zero, 
X  probably  did  not  play  a  significant  role. 

3.  There  are  tests  that  help  detect  underidentification. 

4.  It  is  sometimes  possible  to  remove  underidentification. 

6.10.  Identifiable  parameters  in  an  underidentified 
equation 

When  an  equation  is  underidentified,  is  it  perhaps  possible  to 
identify  one  or  more  of  its  parameters,  though  not  all?  For  instance, 
what  about  the  identifiability  of  7n  in  (6-7)  ?  Intuition  says  that  711 
cannot  be  adulterated  by  linear  combinations  of  DD,  since  Z\  occurs 
only  in  the  supply  SS,  Intuition  is  wrong  if  it  concludes  that  this 
fact  makes  711  identifiable.    Applying  (6-4),  we  havo 

—  7ll   =   7Tll  -f*  0127T21 

The  TS  can  be  computed  from  the  reduced  form 

1  Henry  Schultz,  The  Theory  and  Measurement  of  Demand,  pp.  72-81  (Univer- 
sity of  Chicago  Press,  Chicago:  1938). 
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2/i  -»  vnZi  +  Vi 

Vl  «■  7T2121  +  Vf 

but  /3i2  is  and  remains  unknown  and,  therefore,  so  does  ?n. 

So,  contrary  to  intuition,  the  fact  that  a  given  variable  enters  one 
equation  of  a  model  and  no  others  does  not  make  its  coefficient  identi- 
fiable. Underidentification  is  a  disease  affecting  all  parameters  of 
the  affected  equation.  For,  if  the  gth.  equation  is  unidentified,  this 
means  that  there  are  fewer  equations  than  unknowns  in  the  gth  row  of 
formula  (6-4).  All  coefficients  of  the  ^th  equation  enter  (6-4)  sym- 
metrically, and  so  none  can  have  a  privileged  position  over  the  others. 

Let  us  now  ask  whether  we  can  identify,  in  an  otherwise  unidenti- 
fiable equation,  the  ratio  of  two  unidentifiable  coefficients.  In  special 
cases  it  may  be  both  important  and  sufficient  to  know  the  relative 
rather  than  the  absolute  impact  of  two  kinds  of  variables.  Let  us 
consider  (6-8)  as  a  model  of  the  supply  and  demand  for  loans,  where 
?/i  is  quantity  of  loans,  7/2  is  interest  rate,  Z\  is  balances  at  the  Federal 
Reserve,  and  zi  is  balances  at  foreign  banks.  We  are  curious  to  know 
whether  the  two  kinds  of  bank  balances  differ  in  their  effects  on  the 
loan  policy  of  a  commercial  bank.  Is  it  possible  to  identify  t\\/y\%! 
No,  because  (6-4)  applied  to  this  model  yields 

""Til    =   Til   4"  012^21 
—  712    =    7Ti2  +  0127T22 

which  cannot  be  solved  for  711/712  so  long  as  0i2  is  unidentified.     The 
most  we  can  get  is  the  relation 

7u  4-  frn  __  Y12  -f-  ^12 

7T21  7T22 

which  is  a  straight  line  in  the  711,712  space,  giving  an  infinity  of  pairs 
(7n,  Y12). 

Exercises 

6.B    Derive  explicitly  the  equations  —  r  =  Bn  for  (6-8). 

6.C  In  the  above  exercise,  compute  the  two  values  of  0n  in  terms 
of  the  coefficients  of  the  reduced  form.  Under  what  arithmetical 
conditions  would  they  be  identical?    Interpret  this  in  economic  terms. 
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6.11.  Source  of  ambiguity  in  overidentified  models 

Let  us  return  to  (6-S),  rewriting  it  for  convenience 

SS  q  +  Pup  +  ynr  +  7nC  -  ux  (Q  Q. 

DD         0nq  +       V  -  w2  K~  } 

where  q  ,«■  quantity,  p  ■*  price,  r  =  rainfall,  c  —  last  year's  price  of  a 
competing  crop.  Supply  is  underidentified,  and  demand  is  over- 
identified.  For  the  latter  we  get  from  the  reduced  form  two  incom- 
patible estimates  of  the  single  unknown  /32iJ 

a'       —     ~ *21  off     _     ~^22 

P21  — .  "^ — ■        P21 5 — 

7Tn  7T12 

But  why  should  the  reduced  form,  if  estimated  by  least  squares,  give 
two  values  for  02i>  the  price  elasticity  of  demand?  The  answer  is  in 
terms  of  the  wobblings  of  the  supply  function.  In  (6-9),  supply 
wobbles  in  response  to  random  shocks  u\  and  to  two  unrelated  exoge- 
nous variables,  this  year's  rainfall  r  and  last  year's  price  c  of  a  compet- 
ing crop.  In  Fig.  10a  I  have  drawn  some  supply  curves  corresponding 
to  different  amounts  of  rainfall  (+1,  —1)  for  a  fixed  value  of  c  (=  0). 
Observable  points  fall  in  the  parallelogram  ABCD.  On  the  other 
hand,  in  Fig.  106  the  variations  in  supply  come  not  from  rainfall  (which 
is  held  constant  at  0)  but  from  last  year's  price  only.  Observations 
fall  in  the  parallelogram  EFGH.    The  first  estimate  of  (3 

_ftl  .*!!..  *te*±*  (6.10) 

#11  ™>(q.c)-(r,c) 

corresponds  to  the  broken  line  in  Fig.  10a,  because  #21/^11  correlates 
price  and  quantity  reactions  as  they  result  from  variations  in  rainfall 
only.    The  other  estimate 

off    __  #22         m(r,p).(r,c)  /n  in 

—P21  —  r~      — \p-Li) 

#12  M>(r,q)'(r,c) 

corresponds  to  the  broken  line  in  Fig.  106,  because  it  correlates  p  to  q 
as  a  result  of  variations  in  last  year's  price  alone.1  The  sample  must 
be  very  peculiar  indeed  that  yields  equal  estimates  ffn  and  £&. 

1  In  expressions  like  (6-10)  and  (6-11),  the  heuristic  device  of  canceling  the 
"factors"  c  and  r  in  numerator  and  denominator  gives  a  correct  interpretation 
of  what  is  being  correlated,  provided  that  these  "factors"  appear  on  both  sides 
of  both  dots. 


6.11.  SOUBCE  OF  AMBIGUITY  IN  OVERIDENTIFIED  MODELS 
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The  explanation,  then,  is  at  bottom  simple:  When  demand  is 
overidentified,  this  means  that  both  rainfall  r  and  lagged  prim  c  rrmke 
the  supply  shift  up  and  down,  and  trace  the  demand  relationship  for  us. 
The  original  form  of  the  model  shows  this.  The  reduced  form,  how- 
ever, does  not  allow  us  to  trace  the  demand  uniquely,  as  the  result  of 
the  combined  effect  of  rainfall  and  lagged  price.  Rather,  the  reduced 
form  gives  us  a  choice  of  estimating  the  slope  of  the  demand  equation 
either  as  a  result  of  rainfall-induced  variations  in  supply  or  m  a  result 
of  lagged-price-induced  variations  in  supply.  Essentially,  then,  either 
alternative  leaves  out  some  crucial  consideration,  namely,  the  fact  that 


sA 

^^^V 

■>JK1r 

^^^ 

J^^Bj^ 

yPS" 

*X^<_^  Slope  % 

—  Mg 

(a)  (ft> 

Fig.  10.  Ambiguity  in  an  overidentified  equation. 

the  omitted  variable  (lagged  price  and  rainfall,  respectively)  also 
affects  the  price  and  quantity  combinations  that  the  sample  shows. 
To  show  that  (6-10)  is  a  biased  estimate,  write  p  =  u%  —  @n$>    Then 

W(p,c).(r,c)   =   m>(ut,c)>(r,c)  ~  p21™>(q,ch(r,c),  and  SO 


-"•021   =    ~021  + 


m(Ut,eHr,e) 
W(,,c).(r,e) 


The  expected  value  of  the  bias  term  is  not  zero.  This  is  easily  seen 
from  (6-9).  Let  r,  c,  and  wi  be  fix^d,  and  let  u*  take  on  a  set  of  con- 
jugate values  4-W2  and  —  u2  ?*  0.  Then,  in  (6-9),  q  necessarily  takes  on 
two  different  values  q'  and  q",  and  thus  the  above  denominator  changes 
as  ui  takes  on  its  conjugate  values.  Therefore,  m(+u„eMr,0)/ftl(c',«Kr.«) 
and  W(_Ul,c).(ri<j)/m(a'\C).(r,C)  do  not  add  up  to  zero.    To  show  that 
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fi'n  is  a  consistent  estimate,  consider  that  flim  m<tt|,C).(r.e)  ■=•  0  but 
Plim  m(fllC).(r,0)  ?*  0. 

Exercises 

6.D  If  it  turns  out  that  #'21  =  /5ji,  the  sample  moments  must 
satisfy  either  the  equation  mrrmcc  =  mremre  or  the  equation 
mpcmqr  =  mprmqe.  The  first  of  these  declares  that  rainfall  and  last 
year's  price  are  perfectly  correlated  in  the  sample.  Interpret  the 
second  one.     Hint:  Use  the  fact  that  p  =  u2  —  fog. 

6.E  If  least  squares  are  applied  to  the  reduced  form,  obtaining 
tFs,  prove  the  following:  (1)  that  all  parameters  £,  y  that  can  be  esti- 
mated by  working  back  from  the  reduced  form  are  consistent;  (2)  that 
all  7s  that  can  be  estimated  (whether  uniquely  or  ambiguously)  are 
in  general  biased;  (3)  that  all  /3s  that  can  be  estimated  (whether 
uniquely  or  ambiguously)  are  in  general  biased. 

In  the  following  exercises,  p  (price)  and  q  (quantity)  are  the  endoge- 
nous variables.  The  exogenous  variables  are  i  (interest  rate),  /  (liquid 
funds),  and  r  (rainfall). 

6.F    Show  that  in 

SS  q  4-  pup  =  u        (exactly  identified) 

DD        02i#  4-      p  +  yni  *=  v         (underidentified) 

only  0i2  can  be  estimated  unambiguously. 
6.G    From  the  model 

SS  q  -f  Pup  4-  7nr  =  u        (overidentified) 

DD        j52i?  4-      p  4-  72i«  +  722/  =  v         (exactly  identified) 

the  reduced  form  leads  to  the  following  estimates  of  £12: 
at   '     *n  oft  _  *« 

#21  7T22 

where  § u  ■■  m(7,/,r).(v,/,r)/m(.-,/,r).(t-,/,r)and7F2i  =  wi(P,/,r).(»,,/,r)/w«,/,r).(»,,/.r). 
Show  that  these  estimates  are  biased  and  consistent. 
6. II    In  Exercise  6.G,  find  the  bias  (if  any)  for  ^13,  foi,  ^22. 

6.12.  Identification  and  the  parameter  space 

The  likelihood  function  L(S)  may  or  may  not  have  a  unique  highest 
maximum  as  a  function  of  the  parameter  estimates  A.  If  it  does,  the 
model  is  (exactly  or  over-)  identified. 


Fig.   11.   Maxima  of  the  likelihood  function,    a.   Underidentificdj  b,  exactly 
identified;  c.  overidentified. 
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Along  the  axes,  labeled  a  and  0  in  Fig.  11,  let  me  represent  the 
parameter  space.  Usually  this  space  has  mdre  dimensions,  but  I 
cannot  picture  these  on  flat  paper. 

Underidentification  is  pictured  in  Fig.  11a.  Here  the  mountain  has 
either  a  flat  top  T  or  a  ridge  RRf,  or  both.  Its  elevation  is  highest  in 
many  places  rather  than  at  a  single  place;  i.e.,  there  are  many  local 
maxima.  This  means  that  several  values  of  a  and  /3  are  candidates  for 
the  role  of  estimates  of  the  true  a  and  /3.  In  the  picture  these  can- 
didates lie  in  the  cobra-like  area  PP'Q  that  creeps  on  the  floor. 

When  the  system  is  (exactly  or  over-)  identified,  nothing  of  this  sort 
happens.  The  mountain  has  a  single  highest  point.  If  the  system  is 
exactly  identified,  this  fact  is  the  end  of  the  story,  and  Fig.  116  applies. 
When  the  system  is  overidentified,  then  Fig.  1  lc  applies.  The  moun- 
tain in  Fig.  1  lc  is  the  same  as  in  Fig.  116,  but  we  have  several  conflicting 
ways  to  look  for  the  top.  One  estimating  procedure  allows  us  to  look 
for  the  highest  point  of  the  mountain  along,  say,  the  38th  parallel; 
another  equally  admissible  procedure  tells  us  to  look  for  it  not 
along  the  38th  parallel  but  along  the  boundary  XY  between  area 
I  and  area  II.  Accordingly,  we  get  P'  and  P",  two  estimates  of  P  that 
correspond  to  j3'21  and  flj'i  of  equations  (6-10)  and  (6-11). 


6.13,  Over-  and  underidentification  contrasted 

The  example  in  the  figure  suggests  that  overidentification  and 
underidentification  are  not  simple  logical  opposites,  except  in  a  very 
trivial  sense — in  relation  to  the  Counting  Rules.  Table  6. 1  gives  the 
contrasts  among  over-,  exact,  and  underidentification. 

We  say  that  underidentification  is  not  usually  a  stochastic  property, 
because  it  arises  from  the  a  priori  specification  of  the  model  and  not 
from  sampling,  and  so  it  cannot  be  removed  by  better  sampling. 
Stochastic  underidentification  is  in  the  nature  of  a  freak;  it  was 
illustrated  in  Sec.  6.8.  On  the  other  hand,  overidentification  is  a 
stochastic  property  that  arises  because  we  disregard  some  information 
contained  in  the  sample.  Overidentification  is  removed  if  all  the 
information  of  the  sample  is  utilized — which  means  that  reduced-form 
least-squares  estimation  must  be  abandoned. 


6.14.  CONFLUENCE 
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Table  6.1 
Degree  of  identification 

Underidentifi- 

Exact 

OvERIDENTIF!- 

CATIDN 

IDENTIFICATION 

CATION 

Unique  maximum  of 
the  likelihood  func- 
tion 

Does  not  exist 

Exists 

Exists 

A  -priori  restrictions  for 
locating  single  high- 
est point 

Not  enough 

Enough 

Too  many 

Ambiguity,  if  any, 
introduced  because: 

You  have  not 
enough  inde- 
pendent varia- 
tion in  supply 
and  demand 

No  ambiguity 

In  reduced  form 
you  disregard  one 
or  another  eauae 
in  the  variation 
of  supply 

Estimate  of  tho 
parameters  if 
based  on  re- 
duced form 

/3s 

Biased,  con- 
sistent j 

Biased,  con- 
sistent 

Biased,  consistent 

78 

Biased,*  con- 
sistent. 

Biased,*  con- 
sistent 

Biased,*  consistent 

Is  the  degree  of  identi- 
fication a  stochastic 
property? 

Not  usually;  yes, 
if  in  fact  a 
variable  fails 
to  vary 

Yes 

In  special  cases,  unbiased. 


6.14.  Confluence 


Multicollinearity  and  underidentification  are  two  special  cases  of  a 
mathematical  property  called,  confluence. 

Multicollinearity  arises  when  you  cannot  separate  the  effects  of  two 
(or  more)  theoretically  independent  variables  because  in  your  sample 
they  happen  to  have  moved  together.  This  topic  is  taken  up  again 
in  Chap.  9. 

To  show  the  connection  between  underidentification  and  multi- 
collinearity, I  shall  use  a  model  adapted  from  Tintner,  p.  33,  which 
contains  both. 


104 


IDENTIFICATION 


Suppose  that  the  world  price  of  cotton  is  established  by  supply  and 
demand  conditions  in  the  United  States.  Let  the  supply  of  American 
cotton  q  depend  only  on  its  world  price  p,  while  the  American  demand 
for  cotton  depends  both  on  its  price  and  on  national  income  y. 


DD        q  -  ap  +  #?/  -f  u 
SS         q-yp  +  v 


(6-12) 


How,  demand  in  this  model  is  underidentified.     If  the  sample  comes 


Fig.  12.  Confluence. 

from  earlier  years,  when  cotton  was  king,  then  the  model,  in  addition, 
suffers  from  multicollinearity,  because  the  national  income  was  strongly 
correlated  with  the  price  and  quantity  of  cotton.  In  the  parameter 
space  for  a,  £,  and  7,  the  likelihood  function  L  has  a  stationary  value  over 
a  region  of  the  space  afiy.  To  picture  this  (Fig.  12)  let  us  forget  7 — or 
assume  that  someone  has  disclosed  it  to  be  -f-0.03.  The  true  value  of 
the  parameters  is  the  point  (a,/?,0.03).  The  ambiguous  area,  over 
which  L  has  a  flat  top,  is  the  band  PQRS  in  Fig.  12.  If  a  sample  is 
taken  from  more  recent  years,  multicollinearity  is  reduced,  because 
national  income  y  and  the  world  price  of  cotton  p  are  no  longer  so 
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strongly  correlated  as  before.  In  Fig.  12,  the  gradual  diversification  of 
America's  economy  would  appear  as  a  gradual  migration  of  the  points 
in  the  band  PQRS  toward  a  narrower  band  around  the  curve  MN.  If 
the  time  comes  when  cotton  becomes  quite  insignificant,  then  multi- 
collinearity  will  have  disappeared,  but  not  the  underidentiflcation. 
In  the  figure,  the  band  will  have  collapsed  to  the  curve  M 2V,  but  not 
to  a  single  point. 

Exercise 

6.1    Suggest  methods  for  removing  multicollinearity  and  discuss 
them. 

Digression  on  the  etymology  of  the  term 
"multicollinearity" 

Suppose,  for  purposes  of  illustration  only,  that  in  (6-12)  the 
true  values  of  the  parameters  are  simply  a  =  —  1,  0  =  1,  7  »  1. 
Also  suppose  that  national  income  and  the  price  of  cotton  are 
connected  by  the  exact  relation 

V  =  3p 

(The  exactness  is  for  illustrative  purposes  only.  What  follows  is 
also  true  for  the  stochastic  case  y  =  3p  -J-  w.)  Then  the  demand 
can  be  written 

DD        q  =  2p  +  u 

Now,  the  following  estimates  of  a  and  0  are  consistent  with 
all  observations: 

1.  The  true  values  a  =  —  1,  0  =  1 

2.  The  pair  of  values  a  ==  1,  0  =  }4 — because 

q  =  lp  +  %y  4-  u  =  2p  -f  u 

3.  The  pair  of  values  a  =  2,  j3  =  0;  and  an  infinity  of  other 
pairs,  which  can  be  represented  as  the  collinear  points  of  line  AB 
in  Fig.  13 

If  we  take  a  bogus  demand  function,  then  its  parameters  a©, 
/3®  also  form  a  collinear  set  of  points — like  line  CD — that  agrees 
with  the  sample. 

Removing  multicollinearity  causes  EF  to  collapse  into  M ,  AB 
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into  N,  and  CD  intoP;  that  is  to  say,  removing  multicollinearity 
collapses  the  band  between  the  lines  EF  and  CD  into  the  line 
MNP.  On  the  other  hand,  removing  underidentification  col- 
lapses the  same  band  into  the  line  AB. 

Nomu!tfcol!inearity^v    P 
C 


Fig.  13.  Multicollinearity. 


Further  readings 

Koopmans,  chap.  17,  extends  and  refines  the  concept  of  a  complete  model. 

The  excellent  introduction  to  identification,  also  by  Koopmans,  was  cited 
in  Sec.  6.1.  Klein  gives  scattered  examples  with  discussion  (consult  his 
index). 

It  is  worthwhile  to  read  the  seminal  article  of  Elmer  J.  Working,  "  What  Do 
Statistical  'Demand  Curves'  Show?"  (Quarterly  Journal  of  Economics ,  vol.  41, 
no.  2,  pp.  212-235,  February,  1927),  both  for  its  contents  and  in  order  to 
appreciate  how  far  econometrics  has  progressed  since  that  time. 

Trygve  Haavclmo's  treatment  of  confluence,  in  "The  Probability  Approach 
in  Econometrics"  (Econometrica,  vol.  12,  Supplement,  1944),  is  hard  going, 
but  an  excellent  exercise  in  decoding.  If  you  have  come  this  far,  you  can 
tackle  this  piece. 

Those  who  appreciate  the  refinement  and  proliferation  of  concepts  and 
are  not  afraid  of  flights  into  abstraction  may  glance  at  Leonid  Hurwicz, 
"Generalization  of  the  Concept  of  Identification,"  chap.  4  of  Koopmans. 

Herbert  Simon,  "Causal  Order  and  Identifiability,"  in  Hood,  chap.  3, 
shows  how  an  econometric  model  can  be  analyzed  into  a  hierarchy  of  sub- 
models increasingly  more  endogenous  and  how  the  hierarchy  accords  with 
the  statistical  notion  of  causation  and  that  of  identifiability. 


CHAPTER  7 

Instrumental  variables 


7.1.  Terminology  and  results 

The  term  instrumental  variable  in  econometrics  has  two  entirely 

unrelated  meanings: 

1.  A  variable  that  can  be  manipulated  at  will  by  a  policy  maker  as  a 
tool  or  instrument  of  policy;  for  instance,  taxes,  the  quantity  of  money, 
the  rediscount  rate  j 

2.  A  variable,  exogenous  to  the  economy,  significant,  not  entering 
the  particular  equation  or  equations  we  want  to  estimate,  nevertheless 
used  by  us  in  a  special  way  iiji  estimating  these  equations 

In  this  work  only  the  second  'meaning  is  used. 

This  chapter  explains  and  rationalizes  the  instrumental  variable 
technique.     It  shows:  j 

1.  That  the  technique,  though  at  first  sight  it  appears  odd,  is 
logically  similar  to  other  estimating  methods  and  also  quite  reasonable 

2.  That,  if  the  choice  of  instrumental  variables  is  unique,  the  model 
is  exactly  identified,  and  that  the  instrumental  variable  method  is 
equivalent  to  applying  least;  squares  to  the  reduced  form  and  then 
solving  back  to  the  original  form 
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To  understand  the  logic  of  instrumental  variables  we  must  first  take 
a  deep  look  at  parameter  estimation  in  general. 

7.2.  The  rationale  of  estimating  parametric  relationships 

Two  ideas  dominate  the  strategy  of  inferring  parametric  connections 
statistically.  The  first  idea  is  that  variables  can  be  divided  into 
causes  and  effects.  The  second  is  that  conflicting  observations  must 
be  weighted  somehow. 

Causes  and  effects 

There  can  be  one  or  more  causes,  symbolized  by  c,  and  one  or  more 
effects,  symbolized  by  c.  Various  "instances"  or  "degrees"  of  c  and  e 
will  carry  a  subscript.  A  parameter  is  nothing  more  than  the  change 
in  effect (s),  given  a  change  in  cause (s).  Symbolically  this  can  be 
represented  as  follows: 

~  .  change  in  effect  (s)  Ae       ,-  1N 

Parameter  = — — r ^ rr  =  t-       (7-1) 

corresponding  change  in  cause  (s)       Ac 

This  relation1  is  fundamental  in  Chaps.  7  and  8;  the  general  theme  of 
these  chapters  is  that  all  techniques  of  estimation  are  variations  and 
elaborations  of  (7-1). 

The  change  in  cause (s)  and  effect (s)  can  be  any  change  whatsoever. 
For  instance,  the  change  in  effect  e  may  be  e\  —  62,  or  €2  —  Ci,  or 
0i5  —  026?  in  general,  et  —  e*.  The  corresponding  changes  in  cause  c 
are  d  —  c2,  and  c2  —  Ci,  and  C15  —  025;  in  general,  ct  —  ce. 

Usually,  however,  the  change  is  computed  from  some  fixed  reference 
level  of  the  effect (s)  or  the  cause  (s) — and  this  fixed  level  is  most 
typically  the  mean.2  So  parameters  are  usually  computed  by  a 
formula  like  (7-2)  rather  than  (7-1). 

Parameter  estimate  = ■  =  ft  (7-2) 

Ct  —  c  N 

1  It  is  meant  to  be  a  conventional  relationship.  Only  in  the  simplest  linear 
systems  is  it  true  that  the  numerical  value  of  a  parameter  can  be  expressed  as 
simply  as  in  equation  (7-1). 

9  Ideally,  the  mean  of  the  population.  In  practice,  the  mean  of  the  sample. 
In  linear  models  the  distinction  is  immaterial  for  most  purposes. 
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This  is  merely  a  convenience,  which  does  not  affect  the  logic  of  par&me- 
ter  estimation.  Henceforth,  all  symbols  Ac,  Ac,  etf  ch  or  dimply  e 
and  c  represent  deviations  from  the  corresponding  mean. 

The  problem  of  conflicting  observations 

What  happens  when  two  or  more  applications  of  (7-2)  give  different 
values  for  x?  This  is  very  likely  to  happen  in  stochastic  models, 
because  in  such  models  the  effect  e  results  not  merely  from  the  explicit 
cause  c  but  also  from  the  disturbance  u.  Which  of  the  several  con- 
flicting values  should  be  assigned  to  the  unknown  parameter  *■? 
The  problem  arises  in  all  but  the  simplest  cases  of  parameter 
estimation.  I 

In  general,  the  parameter  estimate  is  a  weighted  average  of  quotients 
of  the  form  e/c.  Take  the  model  et  —  yct  4-  ut.  Then  the  weighted 
estimate  of  7  is  1 

7  =  —  wi  +  —  W2  +  •  •  •  H —  u>i 

C\  ,C2  c, 

Any  set  of  weights  Wi,  W2,  .  .  J  ,  w,  will  do,  provided  they  add  up  to  1. 
If  you  want  to  attach  much  significance  to  the  instance  ew/cw,  make 
wn  large;  if  you  want  to  disregard  the  instance  627/027,  make  Wn  m  0 
or  even  negative.  | 

Let  us  for  simplicity  restrict  ourselves  to  just  two  observations. 
One  of  the  many  possible  sets  of  weights  is  the  following: 


ilL       w  m  -A 

~r :  C2  C\  T*  Cj 


u>i  -  irxb       ^2  -  :txt«  C?-3) 


Then,  using  these  weights,  the  weighted  estimate  is  also  tha  familiar 
least  squares  estimate,  because 


Ci      c\       ,  62    j  c\  e\C\  +  C2C2  ___  tnw 

Ci  cf  +  c|       c2  c[  +  c§  "      c|  -f-  c|      ""  ?/l<* 


So  the  least  squares  estimate  amounts  to  nothing  more  than  a 
special  method  for  weighting;  conflicting  values  of  the  ratios  e/c. 
Now  two  questions  arise:  (1)  Why  should  the  weights  (7-3)  be  functions 
of  the  c's  or  have  anything  at  all  to  do  with  the  cause  c?  and  (2)  why 


M 

c2 

\o'\ 

c{ 

Vki 

log  c2 

S|c| 

Zc2 

2|c3| 

Xo* 

s  VR 

2  log  c2 

110  INSTRUMENTAL  VARIABLES 

should  we  pick  those  particular  formulas  and  not,  for  example,  absolute 
values 

Wl  =      N  „,2  =      N  (7^) 

|ci]  4-  \Ci\  \ci\  +  |cs|  v     ' 

or  cubes,  square  roots,  or  logarithms? 

The  answer  to  question  1  is  that  the  more  strongly  the  cause  departs 
from  its  average  level,  the  more  you  weight  it.  It  is  as  though  we  said 
that  the  real  test  of  the  relationship  et  =  yct  +  ut  is  whether  it  stands 
up  under  atypical,  nonaverage  conditions  (i.e.,  when  ct  is  far  from  c). 

But  now  why  should  one  [as  formulas  (7-3)  and  (7-4)  suggest]  give 
equal  weight  to  +ct  and  —  c<?  The  common-sense  rationale  here  is 
that  the  same  credence  should  be  given  to  a  ratio  e/c  when  c  is  atypically 
small  (relative  to  its  mean)  as  when  it  is  atypically  large.  This 
requirement  is  met  by  an  evenly  increasing  function  of  c,  for  instance: 

(7-5) 

The  answer  to  question  (2)  is  this:  From  the  many  alternative 
formulas  in  (7-5),  c2/2c2  is  selected  because  of  the  assumption  that 
ti  is  normally  distributed,  in  which  case  least  squares  approaches 
maximum  likelihood.  A  different  probability  distribution  of  the 
disturbances  would  prescribe  a  different  set  of  weights  for  averaging 
or  reconciling  conflicting  values  of  e/c. 

7.3.  A  single  instrumental  variable 

With  these  preliminaries  well  digested,  we  are  ready  to  understand 
the  logic  of  instrumental  variables.    Suppose  that 

Vt  -  fat  +  ut  (7-6) 

is  a  model  for  the  demand  for  sugar.  This  equation  is  known  to  have 
come  from  a  larger  system  in  which  p  (price)  and  q  (quantity)  are 
endogenous  and  a  great  many  other  causes  are  active.  We  call  (7-0) 
the  manifest  model  and  all  the  remaining  relationships  the  latent  model. 
Suppose  that  z  represents  some  exogenous  cause  affecting  the  economy, 
say,  the  tax  on  steel.  Now  in  an  interdependent  economic  system  the 
tax  on  steel  affects  theoretically  both  the  price  of  sugar  and  the  quantity 
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of  sugar  bought,  because  a  tax;  cannot  leave  unaffected  either  the  price 
or  the  quantity,  of  any  substitute  or  competitive  goods,  and  sugar  surely 
must  be  somewhere  at  the  end  of  the  chain  of  substitutes  or  com- 
plements of  steel.  ! 

The  method  of  instrumental  variables  says  that  you  ought  to 
compute  an  estimate  of  y  (from  two  observations  8  =  1, 2)  as  follows: 

(7-7) 


(7-8) 


y **  fir  Wl  +  t* w* 

$1                   Q2 

using  as  weights 

1 

9i*i  '■ 

w*  -        qiZ* 

W\   — 

9i* 

1  4"  9222 

W2                       . 

qiZi  +  qzZi 

so  that  the  estimate  of  y 

is     1 

r  _  Pi        9i*i 
qi  q\Z\  +  52*2 

+  P2 

92  \q\Z 

q&2            __  V\Z\  +  P2^2  t 

1  +  92*2       9i*i  +  92*2 

3  2*5 

Every  ounce  of  common  sen&e  in  you  ought  to  rear  itself  in  rebellion 
at  this  perpetration.  You  ought  to  protest,  saying:  "Nonsen.^e! 
My  boss  at  the  Zachary  Sugar  Refinery  will  fire  me  unceremoniously 
from  my  well-paid  and  respected  job  of  Company  Economotrician  if  I 
tell  the  Vice-President  that  I  multiply  the  price  and  quantity  of  sugar 
with  the  tax  on  steel  to  estimate  demand  for  sugar!  Better  give  me  a 
good  argument  for  this  act  of  alchemy.  Moreover,  did  it  ever  occur 
to  you  that  mzq  in  the  denominator  could  conceivably be  equal  to  zero-— 
and  certainly  is  in  some  samples?1  This  predicament  could  not  arise  in 
leastsquare  weights  like  (7-3). " 

I  hasten  to  reply  to  the  last  point  first.  The  possibility  that  mq, 
might  be  zero  is  the  reason  why  z  should  not  be  chosen  haphazardly  but 
rather  from  exogenous  variables  that  have  a  lot  to  do  with  the  quantity 
of  sugar  consumed.  So,  instead  of  the  tax  on  steel,  perhaps  we  ought 
to  take  the  tax  on  coffee,  honey,  or  sugar,  or  the  quantity  of  school 
lunches  financed  by  Congress.  Still,  it  is  possible  that,  in  the  sample 
chosen,  the  quantity  of  sugar  q  and  the  quantity  of  school  lunches  z 
happen  to  be  uncorrected,  and  it  is  true  that  this  sort  of  difficulty  m 
unheard  of  in  least  squares,  because  mqq  j  ust  cannot  be  zero :  the  quantity 
of  sugar  is  perfectly  correlated  with  itself. 

1  This  sample  would  be  a  set  of  measure  zero,  as  the  mathematicians  say* 
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Now  what  do  the  weights  (7-9)  say?  They  say  that,  the  more  sugar 
consumption  and  school  lunches  move  together,1  the  more  weight 
should  be  given  to  A-p/Aq,  the  price  effect  of  the  change  in  the  quantity 
consumed.  There  is  another  way  to  look  at  this  matter:  Write, 
purely  heuristically, 

Ap/Az 


8  -  ^  = 

Aq        Aq/Az 


(7-10) 


where  Ap/Az  is  symbolized  by  71  and  Aq/Az  by  72.     How  could  we 
possibly  interpret  71  and  72? 
Figure  14  represents  both  the  latent  and  the  manifest  parts  of  the 
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Fig.  14.  The  logic  of  the  instrumental  variable. 

model  (7-0).  The  solid  arrow  represents  the  manifest  mutual  causa- 
tion p  «-»  qf  which  appears  in  (7-6).  The  broken  arrows  represent  the 
latent  model,  which  is  not  spelled  out  but  which  states  that  z  affects 
p  and  q  through  the  workings  of  the  whole  economic  system.  So  the 
meaning  of  71  and  72  is  that  they  are  the  coefficients  of  the  latent  model. 
Since  z  is  exogenous,  why  not  estiniate  71  and  72  by  least  squares?  It 
sounds  highly  reasonable.     There  should  be  no  objection  to  this. 


?i  == 


mt 
mt 


Insert  (7-11)  in  (7-10),  and  then 


*  _  Ap/Az 
P       Aq/Az 


7i 

72 


ntzq 
72  =  — - 


mZp/mz 


mtp 
mzq 


mzq/mt 

which  justifies  the  instrumental  variable  formula  (7-9). 
1  In  a  given  instance,  not  over-all. 


(7-11) 


(7-12) 
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It  is  well  to  ask  now:  in  what  way  do  the  two  examples  p  ■  fiq  4-  u  . 
and  e  =  yc  -f*  w  differ?  Why  may  the  second  model  be  computed  by 
least  squares  although  the  first  has  to  be  estimated  by  instrumental 
variables?  The  reason  is  that  the  second  c  — >  e  is  a  complete  model  in 
itself:  causation  flows  from  cjto  e,  and  nothing  else  is  involved.  Oa 
the  other  hand,  p  =  Pq  +  w  hides  a  lot  of  complexity,  i.e.,  causation 
from  z  to  p  and  from  2  to  q.  These  causations  are  unidirectional 
z  -*p  and  z  --*  5  and  can  be  treated  like  c  — >  e;  but  p  «-»  g  is  not  of  the 
same  kind,  and  must  be  treated  by  taking  into  account  the  hidden 
part. 

7.4.  Connection  with  the  reduced  form 

The  instrumental  variable  technique  is  very  intimately  connected 
with  the  method  of  applying  least  squares  to  the  reduced  form. 
Assume  for  the  moment  that  $  is  the  only  exogenous  variable  affecting 

the  economy  and  that  the  complete  model  is 

I 

P  —  Pq         =  u 

1     L    1  (7-13) 

— P  t — *q  —  z  —  v 

whose  first  equation  is  manifest  (the  solid  arrows  in  Fig.  14).    The 
second  equation  is  latent  and  corresponds  to  the  broken  arrows. 

:  Exercise 

7. A  Explain  why  the  complete  model  cannot  contain  three  inde- 
pendent equations,  one  for  p  «;-»  q,  one  for  z  --*  p,  and  one  for  z  ••*  q. 

The  reduced  form  is 

I 

Dp  ==  0z  H u  +  @v 

\*  (7-14) 

Dq  =■    z tt+    v 

I  ?* 

i 

where  D  is  the  determinant  I/72  +  18/72. 

If  the  second  equation  in  (7-13)  contained  another  exogenous 
variable,  say,  2',  then  the  first  iequation  would  be  overidentified.  This 
fact  would  be  reflected  in  the'  instrumental  variable  technique  as  the 
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following  dilemma:  Should  we  use  z  or  z'  as  the  instrumental  variable? 
On  this  last  point  I  have  more  to  say  in  Sees.  7.5  to  7.7. 

7.5.  Properties  of  the  instrumental  variable  technique 
in  the  simplest  case 

The  technique  is  biased  and  consistent,  where  naive  least  squares 
is  biased  and  inconsistent. 

Bm*!s-fi  +  2z>mfi  + 1 _J22 — _ —  (7-15) 

rntq  m,q       r       Z)  ra„  -  (l/yi)mgu  +  mzv  v        ' 

^  g     ,     _1 ^iu  —    (1/Tl)^uu  +  Wvu         

D  mit  -  (2/yi)mtu  +  2mtv  +  (l/y\)muu  -  (2/7i)mrtt  +  mvv 

(7-16) 

Under  the  ordinary  Simplifying  Assumptions,  <rtu  =  <r„  =  ow  =  0; 
so,  for  large  samples,  the  first  expression  approaches  0,  and  the  second 
approaches 

Expression  (7-15)  gives  us  additional  guidance  for  selecting  an 
instrumental  variable.  To  minimize  bias,  the  following  conditions 
should  be  fulfilled,  either  singly  or  in  combination: 

1.  -matt  numerically  small 

2.  m„  numerically  large 

3.  D  numerically  large 

The  first  condition  says  that  z  should  be  truly  exogenous  to  the  sugar 
market.  Appropriations  for  school  lunches  are  better  in  this  respect 
than  the  tax  on  sugar,  because  the  tax  might  have  been  imposed  to 
discourage  consumption  or  to  maximize  revenue,  in  which  case  it 
would  have  some  connection  with  the  parameters  and  variables  of  the 
sugar  market. 

The  second  condition  says  that,  in  the  sample,  the  instrumental 
variable  must  have  varied  a  lot:  if  the  tax  on  sugar  varied  only  trivially 
it  had  no  opportunity  to  affect  p  and  q  significantly  enough  for  us  to 
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capture  /3  by  our  estimating  procedure.  From  this  point  of  view  the 
tax  on  sugar  might  be  less  desirable  as  an  instrumental  variable  than 
some  remote  but  more  volatile?  entity,  say,  the  budget's  appropriations 
for  U.S.  Information  Service,  j 

The  third  condition  says  that  D  =  l/?2  +  0/7 1  should  be  numeri- 
cally large;  that  is,  that  71  and  72  should  be  numerically  small  relative 
to  (3.  This  says  that,  to  minimize  bias,  p  and  q  should  react  more 
strongly  to  each  other  (in  the  manifest  model)  than  to  the  instrumental 
variable  in  the  latent  model,  j  It  requires  that  price  and  quantity  be 
more  sensitive  to  each  other  in  the  sugar  market  than  to  such  things 
as  the  U.S.I.S.  budget,  the  tax  on  honey,  or,  for  that  matter,  the  tax 
on  sugar  itself. 

It  is  not  easy  to  find  an  instrumental  variable  fulfilling  all 
conditions  at  once.     However,  if  the  sample  is  large,  any  sort  of. 
instrumental  variable  gives  better  estimates  than  least  squares. 

7.6.  Extensions 

1 

! 

The  instrumental  variable;  technique  can  be  extended  in  several 
directions:  I 

1.  The  single-equation  incomplete  manifest  model  may  con- 
tain several  parameters  to  be  estimated.  For  example,  the  model 
V  =  0i<Z  +  $iV  +  u  requires  two  instrumental  variables  z\  and  z*. 

The  estimating  formulas  are  analogous  to  (7-9) : 

S  =  m<'i''«>|iMrt        g   __  m(*..«t)(g.p)  (7-17) 

All  the  criteria  of  Sec.  7.5  fo::  selecting  a  good  instrumental  variable 
are  still  valid,  plus  the  following*  z\  and  z2  must  really  be  different 
variables,  that  is,  variables  not  well  correlated  in  the  sample;  else  the 
denominators  approach  zero,  and  the  estimates  &,  £2  blow  up. 

Exercise 

7.B  If  we  wish  to  estimate  the  parameters  of  p  —  fiq  +  yz  +  u, 
where  z  is  exogenous,  is  it  permissible  for  z  itself  to  be  one  of  the 
instrumental  variables  z\  and  22? 

2.  The  incomplete  manifest  model  may  consist  of  several  equations, 
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for  instance: 

021?  4"  ?  +  72«   =  W2 

Each  equation  can  be  estimated  independently  of  the  other,  using 
formulas  analogous  to  (7-17).  Variable  z  itself  and  another  variable  zx 
may  be  used  as  instrumental  variables  in  both  equations,  or  two 
variables  %\%  z2  completely  extraneous  to  the  manifest  model  may  be 
used. 

7.7.  How  to  select  instrumental  variables 

In  some  instances  we  may  have  several  candidates  for  the  role  of 
instrumental  variable.  The  choice  is  made  anew  for  each  equation 
of  the  manifest  model,  and  the  rules  are: 

1.  If  several  instrumental  variables  are  needed,  they  should  be 
those  least  correlated  with  one  another. 

2.  The  instrumental  variables  should  affect  strongly  as  many  as 
possible  of  the  variables  present  in  the  equation  that  is  being  estimated. 

Choosing  instrumental  variables  is  admittedly  arbitrary.  Another 
statistician  with  the  same  data  might  make  a  different  choice  and  so 
get  different  results  for  the  same  model.  The  technique  of  weighting 
instrumental  variables  eliminates  some  of  this  arbitrariness.  I  illustrate 
the  technique  for  the  single-equation,  single-parameter  demand  model 
p  =  @q  +  u.  Suppose  that  two  exogenous  variables  are  available, 
z i  (the  sugar  tax)  and  z<i  (the  tax  on  honey),  and  that  both  affect  p  and 
q.  To  select  z\  or  z2  is  arbitrary.  The  new  variable  z  =  Wizx  +  w2Z2, 
a  linear  combination  of  the  two  taxes  with  arbitrary  weights  W\  and  w2, 
is  less  arbitrary  because  both  taxes  are  taken  into  account.1 

Results  improve  considerably  if  we  take  W\}  w2  proportional  to  the 
importance  of  the  two  taxes  on  the  sugar  market.  Naturally,  to 
estimate  the  parameters  of  the  sugar  market,  the  weight  given  the 
sugar  tax  should  be  greater  than  that  given  to  the  tax  on  honey;  and 
vice  versa  when  we  want  to  study  the  honey  market.  In  general,  we 
ought  to  rank  the  instrumental  variable  candidates  zh  z2,  Zz,  *  .  .  in 
order  of  increasing  remoteness  from  the  sector  being  estimated  and 

1  This  treatment  with  wx  «  u>2  »  1  coincides  with  Theil's  method  with  k  m  1. 
Consult  Chap.  9  below. 
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assign  them  decreasing  weights  in  a  new  instrumental  variable 
z  =  w&i  +  w&2  +  w&z  +••'.••  The  more  accurate  the  a  priori 
information  by  means  of  whioh  weights  are  assigned,  the  more  does 
this  technique  approximate  5he  results  of  the  full  information  maxi- 
mum likelihood  method,  discussed  in  Chap.  8. 

Exercises 

Warning:  These  are  difficult! 

7.C  Prove  or  disprove  the  conjecture  that  weighted  instrumental 
variables  are  "better"  than  unweighted.  Use  the  model  p  =  Pq  -f  u, 
(l/yi)p  +  (I/72)?  —  to  —  $i;Z2  —  v,  where  yh  yit  6h  52  measure  the 
sensitivity  of  p  and  q  to  Z\  and  z2.  Define  W\  and  tt?2  as  6i/(5i  -f  52). 
Define 


m*tQ 

ftliwiZi+Wttj-q 


Prove  e{/Sf(*i)  -  Hfri)])1  >  «WW  -  ttfMll1  <  «WW  -  «(£<*«>] I f- 

7.D  Prove  or  disprove  the  conjecture  that  the  goodness  of  the 
weighted  instrumental  variably  technique  is  insensitive  to  small  depar- 
tures of  W\t  wt  ^om  their  ideal  relative  sizes  5i/(5i  +  62),  62/(61  +  52). 


CHAPTER  8 

Limited  information 


8tl«  Introduction 

Limited  information  maximum  likelihood  is  one  of  the  many  tech- 
niques available  for  estimating  an  identified  (exactly  or  overidentified) 
equation.  Other  methods  are  (1)  na'ive  least  squares,  (2)  least  squares 
applied  to  the  reduced  form,  (3)  instrumental  variables,  (4)  weighted 
instrumental  variables,  (5)  TheiFs  method,1  and  (6)  full  information. 

Method  1  is  biased  and  inconsistent;  the  rest  are  biased  and  con- 
sistent. They  are  listed  in  order  of  increasing  efficiency.  Limited 
information  leads  to  burdensome  computations,  but  is  less  cumbersome 
than  full  information.  Unlike  full  information  but  like  all  other 
methods,  limited  information  can  bo  used  on  one  equation  of  a  model 
at  a  time.  Limited  information  differs  from  the  method  of  instru- 
mental variables  in  two  ways:  it  makes  use  of  all,  not  an  arbitrary 
selection,  of  the  exogenous  variables  affecting  the  system;  it  prescribes 
a  special  way  of  using  the  exogenous  variables.  If  an  equation  is 
exactly  identified,  limited  information  and  instrumental  variables  are 
equivalent  methods.  Like  all  methods  of  estimating  parameters, 
limited  information  uses  formulas  that  are  nothing  more  than  a 

1  Discussed  in  Chap.  9. 
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glorified  version  of  the  quotient 

Change  in  effect  (s) 

Corresponding  change  in  cause(s) 

I  shall  illustrate  by  the  example  of  (8-1),  where  the  first  equation  ii  l@ 
be  estimated  by  the  limited  information  method.  The  rest  of  !h§ 
model  may  be  either  latent  or  manifest.  The  limited  information 
method  ignores  part  of  what  goes  on  in  the  remaining  equations  by 
deliberate  choice,  not  because  they  are  latent  (though,  of  course,  they 
might  be).  However,  in  (8-1)  the  entire  model  is  spelled  out  for 
pedagogic  reasons.  The  minus  signs  are  contrary  to  the  notations! 
conventions  used  so  far  but  ssre  very  handy  in  solving  for  ff\,  i/s»  y%» 
Nothing  in  the  logic  of  the  situation  is  changed  by  expressing  B  and  T 
in  negative  terms. 

2/i  -  ft/2  -  7*1  -  Wi 

2/2  -  0232/3  -  72^2  -  723^3  -  78*24   *   «2       (84) 

~ 0312/1  -f  1/3  -  731*1  -  734^4   «   U% 

As  usual,  a  single  asterisk  distinguishes  the  variables  admitted  in 
the  first  equation  and  a  double  asterisk  those  excluded  from  it.  Thus 
y*  =  vec  (2/1,2/2),  z*  =  vec  (zj,  y**  ■■  vec  (1/3),  z**  -  vec  (zi,fi|f«).. 

We  apply  the  limited  information  method  in  two  cases: 

1.  When  nothing  more  is  known  about  the  economy  than  that  some- 
how z2,  Z3,  Zi  affect  it 

2.  When  more  is  known  bub  this  is  purposely  ignored 

8.2.  The  chain  of  causation 

Let  the  first  equation  be  the>  one  we  wish  to  estimate,  out  of  a  model 
containing  several.  The  chains  of  causation  in  a  general  model  of 
several  equations  in  several  unknowns  are  shown  in  Fig.  15.  The  arrow- 
heads show  that  causation  flows  from  the  z's  to  the  y'a  but  not  baok> 
and  mutually  between  the  y's.  Solid  arrows  correspond  to  the  ?&$% 
equation,  broken  arrows  to  the  rest  of  the  model.  The  two  left-hand 
rounded  arrows,  one  solid,  one  broken,  show  that  the  y*'a  (the  endoge- 
nous variables  admitted  in  the  manifest  model)  interact  both  in  the 
first  equation  and  (possibly,  tDo)  in  the  rest  of  the  model.     The  right- 
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Fig.  15.  Causation  in  econometric  models. 


Fig.  16.  Chain  of  causation  in  the  special  model  (8-1). 


hand  rounded  arrow  shows  that  the  y**'s  (the  endogenous  variables 
excluded)  interact,  but,  naturally,  only  in  the  rest  of  the  model. 

Parenthetically,  the  crinkly  arrows  symbolize  intercorrelation  among 
the  exogenous  variables.  Ideally,  the  exogenous  variables  are  uncon- 
nected, but  in  any  given  sample  they  may  happen  to  be  intercorrelated. 
This  is  the  familiar  problem  of  multicollinearity  (Sec.  6.14)  in  its 
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general  form.  The  stronger  the;  correlation  between  one  z  and  another, 
the  less  reliable  are  estimates  of  the  7s  because  different  exogenous 
variables  have  acted  alike  in  ithe  sample  period.  We  shall  ignore 
multicollinearity  and  continue  with  the  main  subject. 

Figure  16  shows  the  chains  of  causation  in  (8-1).  The  variable  %% 
affects  ?/i  and  t/2  in  the  first  equation  and  y\  and  yz  in  the  third. 
Variables  z2,  23,  zk  affect  t/2  and  v/3  in  the  second,  and  z4  affects  7/1  and  ijt 
in  the  third.  We  can  make  the  arrows  between  the  y's  single  rather 
than  double-headed  because  there  are  as  many  equations  as  there  are 
endogenous  variables.  Thus  tlie  model  can  be  put  into  cyclical  form.1 
It  so  happens  that  (8-1)  is  already  in  this  form;  that  is,  given  a 
constellation  of  values  for  the  exogenous  variables  and  the  random 
disturbances,  if  we  give  yz  an  arbitrary  value,  then  y3  determines  y%^ 
which  in  turn  determines  yh  which  in  turn  affects  yz,  and  so  round  and 
round  until  mutually  compatible  values  are  reached. 

8.3.  The  rationale  of  limited  information 

1 
The  problem  of  estimating  thte  model  of  Fig.  16  can  be  likened  to  the 
following  problem.  Suppose  that  zh  z2,  zz,  zk  are  the  locations  of  four 
springs  of  water  and  that  ?/i,  y%,  1/3  are,  respectively,  the  kitchen  tap, 
bathtub  tap,  and  shower  tap  of  k  given  house.  The  arrows  are  pipes  or 
presumed  pipes.  Estimating  the  first  equation  is  like  trying  to  find 
the  width  of  the  pipes  between  z\  and  yi  and  y%  and  of  the  pipe  between 
2/2  and  y\.  The  width  is  estimated  by  varying  the  flows  at  the  four 
springs  z\,  z2,  23,  zk  and  then  measuring  the  resulting  flow  in  the  kitchen 
(2/1),  bathtub  (y2),  and  shower  (2/3).  Limited  information  attempt!  to 
solve  the  same  problem  with  the  following  handicaps  arising  either  from 
lack  of  knowledge  or  from  deliberate  neglect  of  knowledge: 

1.  Pipes  are  known  to  exist  for  certain  only  where  there  &r©  £olid 
arrows  (7,  j3). 

2.  It  is  known  that  z2,  Zz,  z*  enter  the  flow  somewhere  or  other?  but 
it  is  not  known  where. 

3.  It  is  not  known  whether  there  is  another  direct  pipeline  (*y«) 
from  z\  to  the  kitchen  (2/1)  and  bathtub  (?/2). 

4.  The  flow  at  the  shower  (2/3)  is  ignored  even  if  it  is  measurable. 
1  Note  carefully  that  the  cyclical  and  the  recursive  are  different  forms. 
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So  as  not  to  fill  up  page  upon  page  with  dull  arithmetic,  I  am  going 
to  cut  model  (8-1)  drastically  by  some  special  assumptions,  which,  I 
vouch,  remove  nothing  essential  from  the  problem.  The  special 
assumptions  are:  £  =  l.G,  7  =  0.5,  731  —  734  =  0,  1831  =  0.1,  /323  =  2; 
72222  -f  723Z3  +  72424  is  combined  into  one  simple  term  =  722**; 
72  =  0.5.    Then  (8-1)  collapses  to 

t/i  --  Qy3  -  yzX  =  Mi 

2/2  -  £232/3  -  722**  =  u2  (8-2) 

-  0312/1  +2/3  =  M3 

and  Fig.  16  collapses  to  Fig.  17.     Now  let  us  change  metaphors. 


Fig.  17.  Another  special  case. 

Instead  of  a  hydraulic  system,  think  of  a  telephone  network.  The 
coefficients  0,  7,  if  greater  than  1,  represent  loud-speakers;  if  less  than 
1 ,  low-speakers.  Where  a  coefficient  is  equal  to  1,  sound  is  transmitted 
exactly.  To  avoid  having  to  reconcile  conflicting  observations,  assume 
that  all  the  disturbances  are  zero,  i.e.,  that  there  is  neither  leakage  of 
sound  out  of  nor  noise  into  the  acoustic  system  of  Fig.  17. 

Here  is  how  the  estimating  procedure  works.  Begin  from  a  state 
of  acoustical  equilibrium,  and  measure  the  noise  level  at  each  point  of 
the  network.  Then  step  up  the  sound  level  at  z**  by  100  units. 
Only  50  of  these  reach  location  2/2,  because  there  is  a  twofold  low-speaker 
(72  =  0.5)  between  z**  and  2/2.  Also  step  up  the  sound  level  at  z*  by, 
say,  10  units.     Only  5  units  (7  =  0.5)  get  to  2/1.     But,  whatever  extra 
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noise  there  is  at  y\,  one-tenth  bi  it  (08i  =  0.1)  reaches  y%.  From  y%  a 
loud-speaker  doubles  the  increment  as  it  conveys  it  to  y%,  whence  some 
gets  to  Vh  and  so  on.  By  differencing  (8-2)  and  solving  for  Aw  ■  0, 
Az*  =  10,  and  Az**  =  100,  the  ultimate  increments  are  found  to  be 
A?/i  =  125,  At/2  =  75,  Ayz  —  12.5.  Now,  suppose  we  did  not  know  how 
strong  was  the  low-speaker  connection  /3  between  t/i  and  y*  By 
differencing  (8-2),  we  get  j 

* 'Ayi-  7  As*  .      . 

When  the  model  is  exact,  it  takes  exactly  five  observations  to  determine 
0,  7,  72,  fe,  03i.  When  the  model  is  stochastic,  there  are  complica- 
tions, but  the  basic  appearance  of  the  formula  is  not  much  different. 
The  numerator  can  be  interpreted  as  that  change  in  the  sound  level  y\ 
not  attributable  to  what  is  coming  over  the  line  from  z*,  that  is  to  say, 
only  the  sound  that  comes  from  y2  and  ?/3.  The  denominator  measures 
the  increment  at  y-i  resulting  from  two  sources  z**  and  y*.  The  limited 
information  method  just  ignores  the  latter  source  entirely.  This  is  so 
because  both  /323  and  03i  belong  to  the  "rest  of  the  model"  and  are 
neither  specified  nor  evaluated.     So,  (8-3)  is  interpreted  as  follows: 

_  variation  ;.n  y*  not  due  to  any  z*  ,R  .. 

variation  in  y\  from  all  sources 

The  limited  information  method  suppresses  /?23  and  /33i  and  estimates 
0by 

__  A?/i  —  7  Az*  _  variation  in  y\  not  due  to  z*  ,~  -. 

72  Az**       ~     variation  in  y*  due  to  z** 

Notice  carefully  that  the  method  suppresses  only  023,  03i,  that  is,  the 
latent  model's  intervariation  of  the  endogenous  variables.  It  does  not 
suppress  72,  i.e.,  the  variation  (in  the  latent  model)  due  to  the  exogenous 
variables  z**.  I 


8.4.  Formulas  for  limited  information 

This  section  shows  that  the  lengthy  formulas  for  computing  limited 
information  estimates  are  just  fancy  versions  of  (8-3).  It  can  safely 
be  skipped,  for  it  contains  no  new  ideas.     To  obtain  estimates  of  the 
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#s  of  the  first  equation,  combine  the  moments  of  the  variables  as  in 
the  following  list: 

In  general  In  model  (8-1) 

1.  Construct 

C  m  my««(nO-1m1!y.  *n(vLV,Kn,$t .*..««)  '  {n*(«i,«,.t,.i4)2)-1 

2.  Construct 

D  -  myvtmBV}-,mav  ™,,,,,    •  (m,,,,)"1  •  (mflVl    mtlVt) 

3.  Construct 

W     SS     **ly»y«     ~     C 

4.  Compute1 

V  =  < 

5.  Compute2 

Q  _  V-JW 

6.  The  estimate  of  the  /9s  of  the  first  equation  Is  a  nontrivial  solution 
(called  the  eigenvector)  of  Q. 

7.  Having  computed  (J,  one  can  calculate  f ,  u,  and  estimates  of  the 
covariances  of  the  disturbances  and  the  parameter  estimates. 

In  steps  1  and  2  above,  the  factors  mz«z«  and  mza5  and  also  the  z  and 
z*  in  the  remaining  moment  matrices  play  a  role  analogous  to  the 
weights  cj/2c?  in  the  least  squares  technique.8  They  just  provide  a 
method  for  reconciling  the  conflicting  observations  generated  by  the 
nonzero  random  disturbances. 

The  matrix  my.y.  corresponds  to  the  pair  of  round  arrows  about  y* 
in  Fig.  15. 

Essentially,  Q  is  an  estimate  of  the  0s  of  the  first  equation.  Q  can  be 
interpreted  as  a  quotient,  because  the  matrix  operation  V-1W  reminds 
one  of  the  ratio  of  two  numbers:  W/V.  Actually,  this  impressionistic 
intuition  is  quite  correct.     W  corresponds  to  an  elaborate  case  of  Ae 

1  Klein  calls  this  B  instead  of  V.  I  use  V  to  avoid  confusion  with  the  B  of 
By  +  Tz  —  u. 

2  Klein  calls  this  A.  I  use  Q  to  avoid  confusion  with  the  A  of  the  model 
Ax  «  u. 

3  Compare  with  Sec.  7.2. 
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(the  change  in  the  effects),  and  V  is  an  elaborate  case  of  Ac  (the  cor- 
responding change  in  the  causes).  Indeed  W  and  V  are  complicated 
cases  of  the  numerator  and  denominator  of  (8-5).  W  is  interpreted  as 
the  variation  of  the  endogenous  variables  not  due  to  any  exogenous 
changes,  and  V  expresses  the  variation  of  the  endogenous  variables 
from  all  sources  exogenous  to  any  equation  of  the  model  and  endoge- 
nous to  the  manifest  part. 

8.5.  Connection  with  the  instrumental  variable  method 

Limited  information  recognizes  that  exogenous  influences  not  present 
in  the  first  equation  influence  the  course  of  events.  The  instrumental 
variable  method  acknowledges  the  same  thing.  Limited  information 
makes  use  of  all  these  exogenous  influences,  whereas  the  instrumental 
variable  method  (generally)  picks  from  among  them,  either  hap- 
hazardly or  according  to  the  principles  of  Sec.  7.7. 

When  the  first  equation  is  exactly  identified,  picking  is  impossible 
and  the  two  methods  coincide. 

8.6.  Connection  with  indirect  least  squares 

The  limited  information  method  can  also  be  interpreted  as  a  form  of 
modified  indirect  least  squares  or  as  a  generalization  of  directional 
least  squares  (see  the  Digression  in  Sec.  4.4).  The  direct  or  naive 
least  squares  method  estimates  /?  essentially  as  the  regression  coefficient 
of  y\  on  2/2.  Haavelmo's  proposition  (Chap.  4)  advised  us  to  minimize 
square  residuals  in  the  northeast-southwest  direction  in  order  to  allow 
for  autonomous  variations  in  the  exogenous  variable,  investment  zt. 
In  (8-1)  there  are  several  such  exogenous  variables  zh  z2,  «3,  za  which 
generate  in  the  2/12/2  plane  a  scatter  diagram  which  is  a  weighted 
average  of  lozenge -shaped  figures  (as  in  Fig.  9),  one  for  zi,  one  for  zit 
and  so  on.  In  matrix  C  (and,  hence,  in  W  and  V)  this  weighted 
averaging  has  taken  place. 

Further  readings 

Hood,  chap.  10,  describes  in  detail  how  to  compute  limited  information 
and  other  types  of  estimates,  and  illustrates  with  a  completely  worked  out 
macroeconomic  model  of  Klein's. 


CHAPTER  9 


The  family  of  simultaneous 
estimating  techniques 


9.1.  Introduction 

We  owe  to  Theil1  a  theorem  showing  that  all  the  estimating  tech- 
niques of  Chaps.  4  to  8  are  special  cases  of  a  new  technique,  which  has 
the  further  merit  of  being  fairly  easy  to  compute.  Section  9.2,  which 
covers  this  ground,  is  addressed  primarily  to  lovers  of  mathematical 
generality  and  elegance;  other  readers  might  skip  or  skim. 

The  other  sections  of  this  chapter  reconsider  underidentification  and 
overidentification  from  the  point  of  view  of  research  strategy.  Section 
9.3  accepts  models  as  given  (over-,  under-,  or  exactly  identified)  and 
suggests  alternative  treatments.  Section  9.4  raises  the  issue  of 
whether  econometric  models  can  be  anything  but  underidentified. 

9*2.  Theil's  method  of  dual  reduced  forms 

This  method  can  be  applied  to  all  equations  of  a  system,  one  at  a 
time.     The  equation  we  want  to  estimate,  called  the  "first"  equation, 

1  Reference  in  Further  Readings  at  the  end  of  this  chapter. 
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comes  from  a  complete  system,  for  instance,  (8-1).  We  know  and  can 
observe  all  the  exogenous  variables  affecting  the  system,  and  we  also 
know  a  priori  which  variables  (endogenous  and  exogenous)  enter  the 
first  equation.  The  other  equations  may  be  identified  or  not.  The 
disturbances  have  the  usual  Simplifying  Properties.  Any  endogenous 
variable  of  the  first  equation  can  be  chosen  to  play  the  role  of  dependent 
variable.  We  shall  use  2/1  in  this  role.  The  remaining  variables  of  the 
first  equation,  namely,  2/2,  .  .  .  ,  ya*\  Z\%  .  .  .  ,  z//»,  must  all  be  differ- 
ent in  the  sample;  that  is  to  say,  they  must  not  behave  as  if  they  were 
linear  combinations  of  one  another.  We  do  not  need  to  know  or 
observe  the  endogenous  variables  2/g*+i,  .  .  .  ,  ya  not  present  in  the 
first  equation. 

Let  one  star,  as  usual,  represent  presence  in  the  first  equation,  and 
two  stars,  absence  from  the  first  equation. 

We  then  form  two  reduced  forms  whose  coefficients  we  calculate  by 
simple  least  squares:  (1)  y*  on  z*  with  parameters  f  and  residuals  v; 
and  (2)  y*  on  z  =  (z*,z**)  with  parameters  p  and  residuals  w.  For 
instance,  to  estimate  the  first  equation  of  (8-1),  compute 


2/i  =  tfnZi  +  h        2/1  -  Pn*i  +  P12Z2  +  P13Z3  +  P14Z4  +  Wi 
2/2  =  #2iZi  -f  v%        2/2  =  P21Z1  +  P22Z2  4-  P23Z3  +  P24Z4  +  Wt 


(9-1) 


The  right-hand  set  in  (9-1)  is  necessary  for  estimating  the  first  and 
useful  for  estimating  the  other  equations  of  (8-1).  Let  us  omit  the 
bird  (v)  where  it  is  obvious. 

Next,  we  compute  the  moments  of  the  residuals  on  one  another  and 
construct  two  new  matrices  D(fc)  and  N(&): 

U\k)  sb  m.(v ,vo*;*» 'b*)(vv  ...V0s*i «aO 

;  —  fcn\(W \vdq+, 0, . . . ,  0)(wj wo*;  c. . » , .  o.» 

JM^/cy  =  m^, vo*\'\ *a«)*i/j  j     A?lIi(W| wq*\  0 0)'U>! 

where  A;  is  a  variable  that  will  be  defined  below.  Then  the  estimates 
of  the  j8s  and  7s  of  the  first  equation  are  given  by 

est  fa  .  .  .  ,/3G*,7i|  .  .  .  ,yH*)  -  [D(*)]-W(*)  (9-2) 

Theil  has  proved  that,  if  Jc  =  0,  then  (9-2)  gives  the  naive  least 
squares  estimate  with  2/1  treated  as  the  sole  dependent  variable.  If 
h  *■  1,  (9-2)  gives  the  method  of  unweighted  instrumental  variables 
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of  Sec.  7.7.    If  k  =  1  -f-  v,  where  v  is  the  smallest  root  of 

del  [m(t, »o»)(»i....  »<?•)  —  (1  +  ^tao^i wo*)(u>i w0*)l  =  0     (9-3) 

then  the  estimates  of  (9-2)  are  identical  with  the  limited  information 
estimates  of  Chap.  8.  All  these  estimates  except  for  the  k  =  0  case 
are  consistent,  but  biased  in  the  /3s.  In  the  case  k  —  1,  the  bias  itself 
can  be  estimated  and  corrected  for.1 

These  findings  not  only  are  exciting  for  their  beauty  and  symmetry, 
but  are  practical  as  well.  The  regressions  (9-1)  are  straightforward 
and  attainable  by  simple  calculation  (see  Appendix  B)  even  for  large 
systems.  The  solution  of  (9-3)  is  not  too  hard,  since  the  number  G* 
of  present  endogenous  variables  seldom  exceeds  3  or  4  in  any  actual 
models.  But  (9-3)  must  be  calculated  over  again  if  we  decide  to 
estimate  the  second  or  third  equation  of  the  original  model.  Theil 
states  that  his  technique  works  if  the  remaining  equations  of  the 
system  are  nonlinear  and  that  it  works  for  large  samples  even  when 
some  of  the  z's  are  lagged  values  of  some  y. 

9.3.  Treatment  of  models  that  are  not  exactly  identified 

This  section  gives  advice  on  how  to  treat  models  that  in  their 
natural  state  contain  some  underidentified  or  some  overidentified 
equations,  or  both.  The  alternatives  are  listed  from  the  most  desirable 
to  the  least  desirable,  disregarding  the  cost  of  computation. 

If  a  model  contains  some  underidentified  equations,  we  need  do 
nothing  about  them  unless  we  wish  to  estimate  them.  The  remaining 
equations,  if  identified,  can  be  estimated  in  any  case. 

If  we  wish  to  estimate  the  underidentified  equation,  we  must  make 
certain  alterations: 

1.  Make  it  identified  by  bringing  in  parameter  estimates  from 
independent  sources,  say,  cross-section  data.  There  are  pitfalls  of  a 
new  kind  in  this  method,  however,  which  are  noted  briefly  in  Chap.  12. 

2.  Identify  the  equation  in  question  by  strategically  adding  variables 
elsewhere  in  the  model.  This  process,  however,  might  de-identify  the 
rest  of  the  model. 

3.  Go  ahead  and  estimate  the  underidentified  equation;  then,  if  you 
have  a  priori  information  on  covariances,  perform  the  tests  of  Sec.  G.9 

1  Compare  with  Appendix  B. 
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to  detect  (or  try  to  detect)  whether  you  have  estimated  a  bogus 

i 

i 


function. 


If,  on  the  other  hand,  the  model  contains  some  overidentified 
equations: 

1.  Use  the  full  information,  maximum  likelihood  method.  This  will 
yield  consistent  and  efficient  estimates  of  the  identifiable  parameters. 

2.  Use  the  limited  information,  maximum  likelihood  method, 

3.  Use  instrumental  variables,  weighted. 

4.  Use  instrumental  variables,  unweighted. 

5.  In  the  given  equations,  add  variables  where  they  are  most  relevant 
in  such  a  way  as  to  remove  the  overidentification. 

6.  Enlarge  the  system  by  endogenizing  a  previously  exogenous 
variable. 

7.  In  the  original  overidentified  model,  remove  the  overidentification 
by  introducing  redundant  variables  in  the  other  equations.  If  it 
turns  out  that  the  redundant  variable  has  a  significant  parameter,  you 
have  succeeded. 

8.  Drop  variables  to  remove  the  overidentification.  Instead  of 
outright  dropping,  you  may  linearly  combine  two  or  more  such 
variables.  This  cannot  always  be  done,  because  the  combined 
variables  are  not  always  present  together  or  absent  together  elsewhere 
in  the  model. 

9.  Use  the  reduced  form,  and  select  arbitrarily  one  of  the  several 
sets  of  alternative  estimates. 

Underidentification  is  a  more  serious  handicap  than  overidentifica- 
tion. To  remove  the  former  you  have  to  make  material  alteration!  in 
the  model.  To  remove  the  latter  you  can  always  use  the  full  informa- 
tion method. 

Whatever  the  final  alterations,  I  would  begin  by  constructing  my 
models  without  worrying  about  identification.  In  doing  so,  I  sm  sure 
that  I  am  acting  in  the  light  of  my  best  a  priori  wisdom,  givcm  the 
objectives  of  my  study  and  my  computing  budget.  If  it  turns  out 
that  identification  makes  alterations  necessary,  I  think  that  honesty 
requires  me  to  keep  a  record;  of  the  identifying  alterations.  Like 
Ariadne's  thread,  this  record  keeps  track  of  my  search  for  a  second 
best;  I  may  want  to  give  up  in  frustration  and  return  to  try  another 
way  out  of  the  Minotaur's  chamber. 
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9.4.  The  "natural  state"  of  an  econometric  model 

Econometricians  have  devoted  a  good  deal  of  attention  to  over- 
identified  models.  This  entire  book,  from  Chap.  6  on,  is  devoted  to 
developing  various  approximations1  to  the  full  information  method, 
which  everybody  tries  to  avoid  because  of  its  burdensome  arithmetic. 
According  to  Liu,2  we  have  been  wasting  our  effort,  because  all  well- 
conceived  econometric  models  are  in  truth  necessarily  underidentified: 

In  economic  reality,  there  are  so  many  variables  which  have  an  important 
influence  on  the  dependent  variable  in  any  structural  equation  that  all 
structural  relationships  are  likely  to  be  "underidentified." 

So  Liu  would  not  use  any  of  our  elaborate  techniques,  but  would 
estimate  just  the  reduced  form  and  do  so  by  simple  least  squares. 
The  reduced  form  is  to  include  as  many  exogenous  variables  as  our 
knowledge  and  computational  patience  permit.  Liu  would  then  use 
these  estimates  for  forecasting,  and  claims  that  they  forecast  better 
than  all  other  techniques. 

These  subversive  ideas  deserve  careful  consideration.  Is  it  true 
that  structural  equations  in  their  natural,  unemasculated,  noble- 
savage  state  are  underidentified?  If  they  are,  in  what  sense  are 
forecasts  from  the  reduced  form  better? 

To  begin  with,  there  are  occasions  in  which  the  investigator  does 
not  care  to  know  the;  values  of  the  structural  parameters  and  is  content 
with  some  kind  of  reduced  form.  To  illustrate  one  occasion  of  this 
sort,  assume  that  the  investigator 

1.  Works  from  a  typical  and  large  enough  sample 

2.  Forecasts  for  an  economy  of  fixed  structure 

3.  Forecasts  from  exogenous  variables  that  stay  in  their  sample 
ranges 

Under  the  above  conditions,  an  investigator  would  be  glad  to  work 
with  a  ready-made  reduced  form  though  not  necessarily  with  parame- 
ters estimated  by  simple  least  squares.  He  would  accept  the  latter 
if  justifiable,  not  for  want  of  anything  better. 

1  Unweighted  and  weighted  instrumental  variables  and  limited  information. 
*  Ta-Chung  Liu,  "A  Simple  Forecasting  Model  for  the  U.S.  Economy,"  p.  437 
(International  Monetary  Fund  Staff  Papers,  pp.  434-466,  August,  1955). 
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Are  econometric  models  necessarily  underidentified?  Admittedly, 
it  is  an  oversimplification,  as  Liu  states,1  to  impose  the  condition  that 
certain  variables  be  absent  from  a  given  structural  equation.  But  it  is 
gross  "overcomplification" — to  coin  a  much-needed  word—to  impose 
no  condition  at  all,  inviting  into  the  demand  for  left-handed,  square- 
headed  J/^-inch  bolts  (and  on  equal  a  priori  standing  with  the  price  of 
steel)  the  average  diameter  of  tallow  candles  and  the  failure  or  success 
of  the  cod  catch  off  the  banks  of  Newfoundland.  My  instinct  advises 
me  to  go  halfway  concerning  these  new  variables:  neither  leave  them 
out  altogether  nor  admit  them  as  equals.     Consider  the  model 

q  +  ap +  yr  +  5/=  u 
$q+    p  =v  C?M) 

consisting  of  one  underidentified  and  one  overidentificd  equation. 
Now,  if  r  and  /  are  admitted  as  equals  in  the  second  equation,  with 
parameters  of  their  own,  the  whole  system  becomes  underidentified. 
But  the  very  knowledge  that  first  convinced  us  to  leave  them  out  of  the 
second  equation  now  advises  us  to  tack  them  on  with  a  priori  small 
parameters,  small  relative  to  0,  ?,  etc.  A  reasonable  restatement 
might  be  the  following: 

q  +  ap  +  yr  +    8f  -  u 
(3q+    p+j(3r  +  k8f~v  v*"°' 

where  j,  h  are  small  constant:?,  say  Kooo>  Koo>  or  some  other  not 
unreasonable  value.  And  now  (wonder  of  wonders!)  both  equations 
have  become  identified.  The  trick  does  not  always  work.  For 
instance,  it  does  not  help  in 

pq  -r   p  =  w2 

to  fill  the  hole  with  kar,  nor  k(3r,  nor  kyr  because  we  still  have  three 
parameters  (a,  0,  7)  to  estimate  and  the  reduced  form  contains  only 
two  coefficients  7F1  =  mqr/mrr,  ih  —  mpr/mrr.  However,  if  the  supply 
of  exogenous  variables  is  less  niggardly  than  in  (9-6)  it  is  not  hard  to 
find  reasonable  ways  to  complete  a  model  so  as  to  identify  it  in  it3 
entirety,  if  we  so  desire. 

The  most  difficult  and  dangerous  step  is  the  assigning  of  values  to 

1  Ibid.,  p.  405.  ; 


132  THE  FAMILY  OP  SIMULTANEOUS   ESTIMATING   TECHNIQUES 

j  and  k.  The  values  must  have  the  correct  algebraic  sign;  otherwise, 
structural  parameters  are  wildly  misestimated.  If  the  correct  magni- 
tudes forj  and  k  are  unknown,  it  is  better  to  err  on  the  small  side  than 
on  the  large.  Too  small  (positive  or  negative)  a  value  of  j  is  better 
than  a  hole  in  the  equation,  but  too  large  a  value  may  be  worse  than  a 
hole. 

9.5.  What  are  good  forecasts? 

If  we  want  to  forecast  from  an  underidentified  model,  we  have  no 
choice  but  to  use  some  kind  of  reduced  form;  from  an  overidentified 
model,  it  is  convenient,  not  compulsory,  to  work  from  a  reduced  form. 
The  entire  question  in  both  cases  is:  What  sort  of  reduced  form? 
How  ought  we  to  compute  its  coefficients? 

To  pin  down  our  ideas,  we  shall  consider  the  model  By  -f  Tz  =  u, 
where  u  has  all  the  Simplifying  Properties;  in  addition  we  shall  make 
the  covariances  aUgUhi  known  fixed  constants,  possibly  all  equal,  so  as  to 
keep  them  out  of  the  way  of  the  likelihood  function.  This  way  we 
concentrate  attention  on  the  structural  parameters  /?,  7,  and  x  and  their 
rival  estimates.  The  reduced  form  is  y  **  riz  +  v,  where  n  =  —  B_1r, 
v  =  B_1u.  The  reduced  form  contains  the  entire  set  of  exogenous 
variables  whether  the  original  form  is  exactly,  over-,  or  underidentified. 

Maximum  likelihood  minimizes 


22* 


by  the  jSs  and  7s;  limited  information  and  instrumental  variables 
approximate  this.  The  naive  reduced  form  advocated  by  Liu 
minimizes 

22* 

<      0 

by  the  ts  (whatever  these  may  be).  Naturally,  the  two  procedures 
are  not  equivalent,  and,  naturally,  the  second  guarantees  that  residuals 
will  be  forecast  with  minimum  variance.1  But  what  is  so  good  about 
forecasting  residuals  with  minimum  variance?    The  forecasts  themselves 

1  Provided  the  sample  and  structure  conform  to  conditions  1  to  3  of  Sec.  9.4. 
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in  both  cases  are  (in  general)  biased,  but  the  forecasts  by  maximum 
likelihood  have  the  greater  probability  of  being  right. 

In  Fig.  18,  p  is  the  course  of  future  events  if  no  disturbances  occur. 
The  curve  labeled  p  shows  the  (biased)  probability  distribution  of  the 
full  information,  maximum  likelihood  estimate  of  p;  it  is  in  general 
biased  (ep  s*  p)  but  has  its  puak  at  p  itself.  Curve  p  m  another 
maximum  likelihood  estimate  (.say,  instrumental  variables  or  limited 
information) ;  it  too  has  a  peak  at  p  but  a  lower  one,  perhaps  a  different 
bias  £p,  and  certainly  a  larger  variance  than  p.  The  reduced-form 
least  squares  estimate  is  distributed  as  in  curve  p;  naturally  it  has  a 


Ep  Ep 


Variat  le  and  its  forecasts 
Fig.  18.  The  properties  of  forecasts,    p:  the  true  value  of  the  forecast  variable 
under   zero    disturbances,    p:   reduced-form    least  squares  estimates,    p:  full- 
information    maximum    likelihood    estimates,     p:   other   maximum   likelihood 

estimates.  ; 


smaller  spread  than  p  and  p;  it  may  be  more  or  less  biased  than  either; 
but  its  peak  is  off  p. 

To  put  this  into  words:  If,  in  the  postsample  year,  all  disturbances 
happen  to  be  zero,  maximum  likelihood  estimates  forecast  perfectly, 
and  least  squares  forecast  impc  rfectly.  If  the  disturbances  are  non- 
zero, both  forecast  imperfectly;  but,  on  the  average  and  in  the  long 
run,  least  squares  forecasts  are  less  dispersed  around  their  (biased) 
mean. 

Which  criterion  is  more  reasonable  is,  I  think,  open  to  debate.  I 
favor  maximum  likelihood  estimates  for  much  the  same  reason  that  I 
accept  the  maximum  likelihood  criterion  in  the  first  place:  If  we  arc  to 
predict  the  future  course  of  events,  why  not  predict  that  the  most 
probable   thing    (u  =  0)    will    happen?     What  else  can   we  sanely 
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assume — the  second  most  probable?  On  the  other  hand,  if  my  job 
depends  on  the  average  success  of  my  forecasts,  I  shall  choose  the 
least  biased  technique  and  disregard  the  highest  probability  of  particu- 
lar instances.  If  I  want  to  make  a  showing  of  unswerving,  unvacil- 
lating  steadfastness,  I  shall  use  the  least  squares  technique  on  the 
reduced  form,  even  though  it  steadfastly  throws  my  forecasts  off  the 
mark  in  each  particular  instance  and  in  the  totality  of  instances. 

Further  readings 

The  reference  for  Sec.  9.2  is  H.  Theil,  "Estimation  of  Parameters  of  Econo- 
metric Models"  {Bulletin  de  Vinslitut  international  de  statistique,  vol.  34,  pt.  2, 
pp.  122-120,  1954).    It  is  full  of  misprints. 

Extraneous  estimators  are  illustrated  in  Klein,  chap.  5,  where  ho  pools 
time-series  and  cross-section  data.  Their  statistical  and  common-sense  diffi- 
culties are  discussed  in  Edwin  Kuh  and  John  R.  Meyer,  "How  Extraneous 
Are  Extraneous  Estimates?"  (Review  of  Economics  and  Statistics,  vol.  39, 
no.  4,  pp.  380-393,  November,  1957). 

Tinbergen,  pp.  200-204,  discusses  the  advantages  and  disadvantages  of 
working  from  a  reduced  form,  but  overlooks  that  its  least  squares  estimation 
is  maximum  likelihood  only  for  an  underidentified  or  exactly  identified 
system. 

Ever  since  Haavelmo,  Koopmans,  and  others  proposed  elaborate  methods 
for  correct  simultaneous  estimation,  naive  and  not-so-naive  least  squares  has 
not  lacked  ardent  defenders.  Carl  F.  Christ,  "Aggregate  Econometric 
Models"  [American  Economic  Review,  vol.  46,  no.  3,  pp.  385-408  (especially  in 
pp.  397-401),  June,  19.56],  claims  that  least  squares  forecasts  are  likely  to  be 
more  clustered  than  other  forecasts;  and  Karl  A.  Fox,  "Econometric  Models 
of  the  U.S.  Economy"  (Journal  of  Political  Economy,  vol.  64,  no.  2,  pp.  128- 
142,  April,  195G),  has  performed  simple  least  squares  regressions  using  the 
data  and  form  of  the  Klcin-Goldberger  model  (for  reference,  see  Further 
Readings,  chap.  1).  See  also  Carl  F.  Christ,  "A  Test  of  an  Econometric 
Model  of  the  United  States  1921-1947"  (Universities-National  Bureau  Com- 
mittee, Conference  on  Business  Cycles,  New  York,  pp.  35-107,  1951),  with 
comments  by  Milton  Friedman,  Lawrence  R.  Klein,  Geoffrey  H.  Moore,  and 
Jan  Tinbergen  and  a  reply  by  Christ,  pp.  107-129.  In  pp.  45-50  Christ 
summarizes  the  properties  of  rival  estimating  procedures.  E.  G.  Bennion, 
in  "The  Cowles  Commission's  'Simultaneous  Equations  Approach':  A  Sim- 
plified Explanation"  (Review  of  Economics  and  Statistics,  vol.  34,  no.  1,  pp. 
49-56,  1952),  illustrates  why  least  squares  gives  a  better  historical  relation- 
ship and  better  forecasts  (as  long  as  exogenous  variables  stay  in  their 
historical  range)  than  do  simultaneous  estimates.  John  R.  Meyer  and  Henry 
Laurence    Miller,    Jr.,   "Some   Comments  on   the   'Simultaneous-equation 
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Approach* "  (Review  of  Economics  and  Statistics,  vol.  36,  no.  1,  February, 
1954),  state  very  clearly  the  different  kinds  of  situations  in  which  forecasts 
have  to  be  made — and  to  each  corresponds  a  proper  estimating  procedure. 

Herman  Wold  says  that  he  wrote  Demand  Analysis  (New  York:  John  Wiley 
&  Sons,  Inc.,  1953)  in  large  part  to  reinstate  "a  good  many  methods  which 
have  sometimes  been  declared  obsolete,  like  the  least  squares  regression  or 
the  short-cut  of  consumer  units  in  the  analysis  of  family  budget  data"  and 
to  "reveal  and  take  advantage  of  the  wealth  of  experience  and  common  sense 
that  is  embodied  in  the  familiar  procedures  of  the  traditional  methods"  (from 
page  x  of  the  preface).  He  believes  that  the  economy  is  in  truth  recursive 
and  that  it  can  be  described  by  recursive  models  whose  equations,  in  the 
proper  sequence,  can  be  estimated  by  least  squares.  His  second  chapter, 
entitled  "Least  Squares  under  Delate"  (especially  sees.  7  to  9),  is  very  far 
from  convincing  me  that  ho  is  right. 


CHAPTER   10 


Searching  for  hypotheses 
and  testing  them 


10.1#  Introduction 

Crudely  stated,  the  subject  of  this  chapter  is  how  to  tell  whether 
some  variables  of  a  given  set  vary  together  or  not  and  which  ones  do  so 
more  than  others.  The  problem  is  how  to  make  three  interrelated 
choices:  (1)  a  choice  among  the  variables  available,  (2)  a  choice  among 
the  different  ways  they  can  vary  together,  and  (3)  a  choice  among 
different  criteria  for  measuring  the  togetherness  of  their  variation. 
The  whole  thing  is  like  a  complicated  referendum  for  simultaneously 
(1)  choosing  the  number  and  identity  of  the  delegates,  (2)  deciding 
whether  they  should  sit  in  a  unicameral  or  multicameral  legislature, 
and  (3)  supplying  them  with  rules  of  procedure  to  use  when  they  go 
into  session. 

This  triple  task  is  too  much  for  a  statistician,  as  it  is  for  a  citizenry: 
it  wastes  statistical  data,  as  it  wastes  voters'  time  and  attention. 
Just  as,  in  practice,  people  settle  independently,  arbitrarily,  and  at  a 
prior  stage  the  number  of  chambers,  the  number  of  delegates,  and  the 
rules  of  procedure,  so  the  statistician  uses  maintained  hypotheses. 
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For  example,  in  the  model  C<  =  ja  +  yZt  +  w«  of  Chap.  1,  the  presence 
of  one  and  not  two  equations,!  two  and  not  four  variables,  all  the 
remaining  stochastic  and  structural  assumptions,  and  the  requirement 
for  maximizing  likelihood  are  the  maintained  hypotheses.  Only  rival 
hypotheses  about  the  true  parameter  values  a  and  7  remain  to  be 
tested.  The  entire  field  of  hypothesis  searching  and  testing  consists  of 
variations  on  the  above  theme.'  The  maintained  hypotheses  can  be 
made  more  or  less  liberal,  or  they  may  change  roles  with  the  questioned 
hypotheses.    Section  10.4  lists  many  specific  examples. 

The  general  moral  of  this  chapter  is  this:  Having  used  your  data  to 
accept  or  reject  a  hypothesis  while  maintaining  others,  you  are  not 
supposed  to  turn  around,  maintain  your  first  decision,  and  test  another 
hypothesis  with  the  same  data.i  If  you  are  interested  in  testing  two 
hypotheses  from  the  same  set  c'f  data,  you  must  test  them  together. 
Thus,  if  you  want  to  find  both  the  form  and  personnel  of  government 
preferred  by  the  French,  you  should  ask  them  to  rank  on  the  ballot  all 
combinations  (like  Gaillard/unicameral,  Gaillard/bicameral,  Pinay/ 
unicameral,  Pinay/bicameral)  and  to  decide  simultaneously  who  is  to 
lead  and  which  type  of  parliament;  not  the  man  first  and  the  type 
second;  not  the  type  first  and  the  man  second. 

Everything  that  follows  in  this  chapter  pretends  that  variables  are 
measured  without  error.  Sections  10.2  and  10.3  introduce  two  new 
concepts:  discontinuous  hypotheses  and  the  null  hypothesis.  Sections 
10.4  to  10.8  explore  some  of  the!  commonest  hypotheses  considered  by 
econometricians,  especially  wheii  they  set  about  to  specify  a  model. 

10.2.  Discontinuous  hypotheses 

Consider  again  the  simple  model  Ct  =  a  +  yZt  4-  ut.  The  rival 
hypotheses  here  are  alternative  Values  of  ct  and  y  and  may  be  any  pair 
of  real  numbers.     This  is  an  example  of  continuity. 

Now  consider  this  problem:  ]|3oes  x  depend  on  1/,  or  the  other  way 
around?  Taking  the  dependence  (for  simplicity  only)  to  be  linear  and 
homogeneous,  the  rival  hypotheses  here  are 

xt  =  yyt  +  ut    versus    yt  —  Bxt  +  vt 

The  answer  is  yes  or  no;  either  the  first  or  the  second  equation  holds* 
This  is  an  example  of  discontinuity.     However,  tne  further  problem  of 
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the  size  of  y  (or  5),  by  itself,  may  be  a  continuous  hypothesis  problem. 
Many  of  my  examples  below  (Sec.  10.4)  are  discontinuous.    The 
simple  maximizing  rules  of  the  calculus  do  not  work  when  there  is 
discontinuity,  and  this  fact  makes  it  very  interesting. 

10.3.  The  null  hypothesis 

In  selecting  among  hypotheses  we  can  proceed  in  two  ways:  (1) 
compare  them  all  to  one  another;  (2)  compare  them  each  to  a  special, 
simple  one,  called  the  null  hypothesis  (symbolized  by  H0).  An  example 
of  the  first  procedure  is  the  maximum  likelihood  estimation  of  (#,7) 
in  the  model  Ct  =  a  +  yZt  +  ut,  since  it  compares  all  conceivable 
pairs  (a,y)  in  choosing  the  most  likely  among  them.  The  other  way  to 
proceed  is  somewhat  as  follows:  select  a  null  hypothesis,  for  example, 
a  =  3  and  y  =  0.7,  and  accept  or  reject  it  (i.e.,  accept  the  proposition 
11 either  a  j^  3  or  7  ^  0.7,  or  both")  from  evidence  in  the  sample.  I 
have  more  to  say  later  on  about  how  to  select  a  null  hypothesis  and 
what  criteria  to  use  for  accepting  or  rejecting  it.  Meanwhile,  note  that 
the  decision  to  proceed  via  null  hypothesis  has  nothing  to  do  with 
continuity  and  discontinuity,  though  it  happens  that  many  applica- 
tions of  the  null  hypothesis  technique  are  in  discontinuous  problems. 

10.4.  Examples  of  rival  hypotheses 

Many  of  the  examples  in  this  section  are  linear  and  homogeneous  for 
the  sake  of  simplicity  only ;  in  these  cases  linearity  (and  homogeneity)  is 
guaranteed  not  to  affect  the  principle  discussed.  In  other  examples, 
however,  linearity  (or  homogeneity)  is  a  rival  hypothesis  ana  thus 
very  much  involved  in  the  principle  discussed.     Now  to  the  examples: 

1.  Which  one  variable  from  a  given  set  of  explanatory  variables  is 
best?  For  instance,  should  we  put  income,  past  income,  past  con- 
sumption, or  age  in  a  rudimentary  consumption  function?  The  rival 
hypotheses  here  are 

Ct  =  0Y,  +  ut       Ct  =  yYt-i  +  ut       Ct'=  tCt-i  +  ut       etc. 

2.  Should  the  single  term  be  linear  or  quadratic,  logarithmic,  etc.? 
The  rival  hypotheses  here  are 
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&  =  0Yt  +  ut       Ct  =  yY*  f  ut       Ct  -  6  log  Yt  +  w«        etc. 

Note  that  this  becomes  a  special  case  of  example  1  if  we  agree  that 
72,  log  Y,  etc.,  are  different  variables  from  Y  (Sec.  10.9). 

3.  What  value  of  the  single  parameter  is  best?  In  Ct  =  &Yt  +  ut 
the  rival  hypotheses  are  different  values  of  /?,  say,  £  =  1,  /3  =  H, 
jg  =  %,  and  others.  This,  too,  is  a  special  case  of  example  1,  because 
it  can  be  expressed  as  a  choice  among  the  explanatory  variables  7,  27, 
47/3,  respectively. 

4.  Should  there  be  one  or  more  equations  in  the  model?  This  ques- 
tion, important  when  several  variables  are  involved,  lurks  behind  the 
problems  of  confluence  (see  Sec.  6.14),  but  it  arises  even  with  two 

variables.  I 

i 
| 
The    above    examples    generalize,    naturally.     For   instance,    the 

question  may  be  which  two  or  which  three  variables  to  include,  which 

linearly,  which  nonlinearly,  how  many  lags,  and  how  far  back. 

5.  Which  variables  are  to  be  regressed  on  which?  The  rival 
hypotheses  are 

Xi  =  ax2  +  u    versus    x2  «  0xi  4*  v 

for  two  variables.     If  we  maintain  the  hypothesis  of  three  variables  in 
a  single  equation,  the  rival  hypotheses  are 

Xi  =  aXi  -f-  0x3  +  u    versus    x2  ■  7x1  +  8x3  +  v 

versus    x3  «  %X\  +  f$§  4-  w 

And,  if  we  maintain  three  variables  and  two  equations,  i\w  rival 
hypotheses  become 

xi  =  ax%  +  pxs  4-  u  xi  =  €X2  4-  fx3  +  w 

1    »      i         versus  ,   n      .   , 

X2  «  t^i  4-  5x3  4-  v  X3  =  rjxi  4-  6x2  4- 1 

j  xa  «  kXi  4"  X#i  +  I 

versus  ,  1 

X3  =  /*Xi  4-  »*X|  4*  F 

and  so  on  for  more  equations  and  more  variables.    This  is  typically  a 
discontinuous  problem.     It  is  discussed  briefly  in  Sec.  10.8. 

6.  Having  decided  that  xi  is;  an  explanatory  variable,  does  it  help 
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to  include  #2  as  well?    The  rival  hypotheses  are 

y  «  axi  +  v    versus    y  =  fax  -f  yx*  +  w 

Section  10.8  contains  hints  on  this  problem. 

7.  Having  decided  to  include  xh  which  one  other  variable  should  be 
added? 

y  =  ax\  -f  fat  -+•  u    versus    y  =  yx\  +  6x3  +  v        etc. 

Section  10.8  applies  to  this  problem. 

8.  Is  it  better  to  have  a  ratio  model  or  an  additive  one? 

c         v 

-  s=*  a  -  +  u    versus    c  =  &y  +  yn  -f*  v 

This  is  discussed  in  Sec.  10.10. 

9.  Is  it  better  to  have  a  separate  equation  for  each  economic  sector 
or  the  same  equation  to  which  is  added  a  variable  characterizing  the 
sector?     For  example,  consider  the  following  rival  demand  models: 

q  «  ap  +  u    for  the  poor 

a     .         r     lL      .  1       versus    q  =  yp  +  ty  +  w 
g  «  0p  -h  t>     for  the  rich  v         ^        * 

where  y  is  income.     Section  10.11  discusses  this  problem. 

10.  (A  special  case  of  the  above.)  Are  dummy  variables  better  than 
separate  equations? 

q  =  ap  -f  u    in  wartime  q  =  yp  +  SQ  +  w 

versus 
q  =  /3p  +  v     in  peacetime  Q  =  0    in  peacetime 

Q  =  1     in  wartime 

This  problem  is  a  special  case  of  the  example  discussed  in  Sec.  10.11. 

11.  Do  variables  interact?  That  is  to  say,  does  the  size  of  one  or 
more  variables  fortify  (or  nullify)  the  others'  separate  effects?  For 
instance,  if  being  stupid  and  being  old  (the  variables  s  and  a,  respec- 
tively) are  bad  for  earning  income,  are  stupidity  and  old  age  in  com- 
bination worse  than  the  sum  of  their  separate  effects?  The  rival 
hypotheses  are 

y  =  as  4-  fa  +  w    versus    y  =  73  +  8a  -f-  csa  -f-  v 
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and  can  also  be  expressed  as  follows: 

y  =  75  +  da  +  «sa  4*  v        Null  hypothesis:  €  =  0 
or  as  follows: 

y  =  as  +  fi  +  w  for  the  young  ' .'  '  a  «  f 

•   *  i      r     iu     ri  Null  hypothesis:  _       . 

2/  =  73  -{-  5  +  v  for  the  old  J  "  0  =  1 

This  case  is  not  spelled  out,  but  the  discussion  of  Sec.  10.6  applies  to  it. 

This  list  is  not  exhaustive.     And,  naturally,  the  above  questions  can 
be  combined  into  complex  hypotheses. 

Digression  on  correlation  and  kindred  concepts 

This  is  a  good  place  to  gather  together  some  definitions  and 
theorems  and  to  issue  some  simple  but  often  unheeded  warnings. 
It  is  also  an  excellent  opportunity  to  learn,  by  doing  the  exercises, 
to  manipulate  correlations  and  regression  coefficients  as  wall  as 
all  sorts  of  moments. 

Universe  and  sample.  Keep  in  mind  that  Greek  letters  refer  to 
properties  of  the  universe  and  that  Latin  letters  are  used  to  refer 
to  the  corresponding  sample  properties. 

Thus,  as  already  explained  in  the  Digression  of  Sec.  1.1*2,  ir*»> 
aw  <rvv  are  population  variances  and  co variances  of  x  and  y.  The 
corresponding  sample  quantities1  are  mXX)  mxy,  myyt  the  so-called 
"moments  from  the  sample  means,"  introduced  in  the  game 
Digression, 

-  2  (*.  -  .*•)<¥.  -  y°) 


nij 


1  To  the  population  covarianccs  c  xy  there  correspond  two  types  of  sample 
quantities:  those  measured  from  the  r.iean  of  the  universe, 


qxV  -  Y  (x.  -  zx){y.  - 


where  8  runs  over  the  sample  S°;  and  those  measured  from  the  mean  of  the  sample, 
namely  mxy.  Interchanging  qxv  and  r,ixy  does  not  hurt  at  all,  in  general,  when  the 
underlying  model  is  linear,  since  mxy  is  an  unbiased,  consistent,  etc.,  estimator  of 
both  qxy  and  <rxy,  etc.  There  are  difficulties  in  the  case  of  nonlinear  models,  but 
we  shall  not  go  into  them  here. 
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where  s  runs  over  the  sample  5°.     The  universe  coefficient  of  cor- 
relation p  is  defined  by 

VzxVyv 

and  the  corresponding  sample  coefficient  r  by 

Later  on  we  define  partial,  multiple,  etc.,  coefficients  of  correla- 
tion. In  ail  cases,  a  coefficient  of  correlation  measures  the 
togetherness  of  two  and  only  two  variables,  though  one  or  both  may 
be  compounded  of  several  others.  This  elementary  fact  is  often 
forgotten. 

For  the  sake  of  symmetry  in  notation,  when  handling  several 
variables,  we  shall  use  x  with  subscripts:  xif  z2,  x3t  etc.  Then  we 
write  simply  pUj  m,  mn  for  p^x*,),  rfaxfch  w<Xl)(*i),  etc. 

Both  pxy  and  rxy  range  from  —  1  to  -f-1.  Values  very  near  ±  1 
mean  that  x  and  y  have  a  tight  linear  fit  like  ax  +  (3y  =  w,  with 
the  residuals  very  small.  A  tight  nonlinear  fit  like  x2  +  V2  =  1 
does  not  yield  a  large  coefficient  of  correlation  px„.  What  we  need 
to  describe  this  fit  is  p(*«xv»).  And  similarly  for  relations  like 
ay  +  |3  log  x  «  n  or  a?/2  +  &x3  «  w,  we  need  p(iog*Kv>>  Pc^'Xv1)* 
respectively. 

10.5.  Linear  confluence 

From  now  on  until  the  contrary  is  stated,  I  shall  deal  with  linear 
relations  exclusively.  The  discission  is  perfectly  general  for  any 
finite  number  of  variables,  but  three  are  enough  to  capture  the  essence 
of  the  problems  with  which  we  shall  be  dealing.  Let  the  three  variables 
be 

X\    number  of  pints  of  liquor  sold  at  a  ski  resort  in  a  day 
Xi    number  of  tourists  present  in  the  resort  area 
Xz    average  daily  temperature 

We  suppose  there  are  one  or  several  linear  stochastic  relationships 
among  some  or  all  of  these  variables.     The  least-squares-regression 
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coefficients  are  denoted  by  a's  and  6's,  with  a  standard  system  of 
subscripts. 

Begin  with  regressions  among  X\}  Xt,  X%}  taken  two  at  a  time; 
there  are  six  such  regressions.  In  (10-1)  below  these  regressions  are 
arranged  in  rows  according  to  the  variable  that  is  treated  as  if  it  were 
dependent  and  in  columns  according  to  the  variable  treated  m 
independent. 

X\  =  ai.2  +  612X2        Xi  =  Oi.3  +  bitXt 

Xi  =  a<i.\  4*  621X1        .  X2  =  a2.3  4-  623X3    (10-1) 

X3  =  a3.i  +  631X1        Xz  =  a32  +  632X2        

In  each  subscript,  the  very  first  digit  denotes  the  dependent  variable. 
If  there  is  a  second  digit  before  any  dot  appears,  it  denotes  the  inde- 
pendent variable  to  which  the  coefficient  belongs.  Digits  after  the 
dot  (if  any)  represent  the  other  independent  variables  (if  any)  present 
elsewhere  in  the  equation.  The  order  of  digits  before  the  dot  is 
material,  because  it  tells  which  variable  is  regressed  on  which.  The 
order  of  subscripts  after  the  dot  is  immaterial,  because  these  digits 
merely  record  the  other  "independent"  variables. 

The  same  three  variables  can  be  regressed  three  at  a  time.  There 
are  three  such  regressions: 

Xi   —   <Zi.23   +  612.3X2  +  613.2X3 

X2  =  a2.i3  +  621.3X1  +  623.1X3  (10-2) 

X3  =  a3.i2  4-  631.2X1  -f-  632.1X2 

As  an  exercise,  consider  the  four-variable  regression 

Xi  =  ai.234  4"  61:5.34X2  4-  613.24X3  4*  614.23X4 

and  fill  in  the  missing  subscripts  in 

X3  =  a__. 4"  6__.__Xi  4*  6 . X2  4-  6 . X4 

Returning  to  our  liquor  example,  suppose  we  decide  to  measure  the 
three  variables  not  from  zero  but  from  each  one's  sample  mean.  If 
primed  small  letters  represent  the  transformed  variables,  we  know 
that  the  a's  drop  out  and  the  6's  remain  unchanged.  This  is  so 
because  the  model  is  linear.  Our  relations  (10-1)  and  (10-2)  now 
become 

x[  =  612X2        •  •  '•        x'z  =  bn.al  4-  632.1*2 


144  SEARCHING  FOR  HYPOTHESES  AND  TESTING  THEM 

Exercises 

10.A  Prove  r<x,)<x,)  =  *Vx*t'>  m  ri2,  that  is  to  say,  that  correla- 
tion does  not  depend  on  the  origin  of  measurement. 

10.  B    Prove 

rlj  =  bifin 
Hint:  Use  moments. 

This  relation  says  that  the  coefficient  of  correlation  between  two 
variables  equals  the  geometric  mean  of  the  two  regression  slopes  we 
get  if  we  treat  each  in  turn  as  the  independent  variable.  The  less 
these  two  regressions  differ,  the  nearer  is  the  correlation  to  +1  or  —  1. 

10.6.  Partial  correlation 

Two  factors  may  account  for  x[,  the  sale  of  a  lot  of  liquor:  (1)  there 
are  many  people  (x\) ;  (2)  it  is  very  cold  (zj).    This  relation  is  expressed 

X[   m  blMXl  +  fris^a  (10-3) 

But  the  reason  that  (1)  there  are  many  people  in  the  resort  is  (a)  that 
the  weather  is  cold,  and  (possibly)  (&)  that  a  lot  of  drinking  is  going  on 
there,  making  it  fun  to  bo  there  apart  from  the  pleasure  of  skiing. 
This  is  expressed 

x'z  =  b2i.zx[  +  623- izj  (10-4) 

Suppose  we  wanted. to  know  whether  liquor  sales  would  be  correlated 
with  crowds  in  the  absence  of  weather  variations.     The  measure  we 
seek  is  the  partial  correlation  between  x[  and  x'2,  allowing  for  x'z.    This 
measure  is  symbolized  by  7*12.3.     It  is  interpreted  as  follows: 
Define  the  variables 

y[  =  x[  -  613.2^  (10-5) 

vi  -  4  -  &,m*J  "  (10-0) 

The  j/8  are  sales  corrected  for  weather  only  and  tourists  corrected  for 
weather  only.  If  we  have  corrected  both  for  weather,  any  remaining 
covariation  between  them  is  due  to  (1)  the  normal  desire  for  people  to 
drink  liquor  (the  more  tourists  the  more  liquor  is  sold),  (2)  the  possi- 
bility that  some  tourists  come  to  enjoy  drinking  rather  than  skiing  (the 
more  liquor,  the  more  tourists),  and  (3)  a  combination  of  the  first  two 
items. 
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The  partial  coefficient  of  correlation  is  defined  by 

Exercises 

10.C    Prove  r2i.3  =  ru.*.       j 

10.D    Prove  r\2.3  -  612.3&21.3.J    This  is  analogous  to  Exercise  10.B. 
Hint:  Substitute  (10-6)  and  (10-7)  into  (10-3)  and  (10-4). 
10.E    Prove 

—  7*12  —  ri3r23 

12,8       (1  r  Wl  -  A,)* 

from  definition  (10-7)  and  Exercises  10.C  and  10.D. 

10.  F  Give  a  common  sense!  interpretation  of  the  propositions  in 
the  above  three  exercises. 

10.G  All  this  generalizes  to  four  and  more  variables,  but  notation 
gets  wry  mossy.  Exorcise  10.D  generalizes  into  tho  proposition: 
Every  (partial  or  otherwise)  [coefficient  of  correlation  equals  tho 
geometric  mean  of  the  two  relevant  regression  coefficients.  So,  for 
example,  r2u.u  =  612.34621.34. 

Let  r  stand  for  the  matrix  of  all  simple  coefficients  of  correlation 
r»y,  and  let  Rij  stand  for  the  minor  of  r»y.  Then  Exercise  10.E  is 
rewritten 


R 11  Ri 


and  with  four  variables 


rim  -  ii-34  -  fii'H  -  rlva  -  g^gjj 

and  so  on  for  any  number  of  variables,  the  dimension  of  R  growing 
all  tho  while,  of  course 

10.11    Show  that  rj2>8  =>  R\%/R\\Rn  holds  but  collapses  into  an 
identity  when  there  is  no  third  variable. 

i 
10.7.  Standardized  variables 

Let  us  now  measure  Xi,  X2,  X3  not  only  as  departures  x{,  x'2,  x\  from 
their  sample  means  but  also  in  units  equal  to  the  sample  standard 


146  SEARCHING   FOR  HYPOTHESES   AND  TESTING  THEM 

deviation  of  each.    So  transformed,  the  variables  are  called  just 
$h  %h  **• 


This  step  is  useful  in  bunch  map  analysis  (see  Sec.  10.8).  When 
this  is  done,  nothing  happens  to  either  the  population  or  the  sample 
correlation  coefficients,  but  the  regression  parameters  between  the 
variables  do  change. 

Exercises 

10.1    Prove  mXlXt  =  (mXlXlmXlXl)->%<Xlo  <*,'). 

10.  J  Prove  that  r^'x*,')  «-  r (*,)<*,)  "■  rn  by  using  Exercises  10.  B 
and  10.C. 

10. K  Denote  the  regression  coefficients  among  x[9  x'2f  x's  by  the 
letter  b  and  the  corresponding  coefficients  among  the  standardized 
variables  xh  x2,  xz  by  the  letter  a,  with  appropriate  subscripts.  Inter- 
pret an.3,  021.3,  Aim;  show  that  they  differ  in  meaning  from  ai.23,  02-13, 
flan,  respectively. 

10.L  Show  that  an  »  6it(ffiu/ttiu)*S  and,  in  general,  that 
<ty.»  =  bij,k(mjj/mu)Kt 

10.  M    Show,  by  using  Exercise  10.L,  that  r*.*  ■  flatty.*. 

10.N  Show  that  rn  «■  an.  This  is  a  very  important  property, 
which  says  that  regression  and  correlation  coefficients  are  identical 
for  standardized  variables. 

10.O  Let  x"  =  (Xi  -  eX,)((rtl)-^.  Prove  p<x<")<*/'>  =  P(xinx,)f  and 
interpret. 


10.8.  Bunch  map  analysis 

Bunch  maps  are  mainly  of  archaeological  or  antiquarian  interest. 
They  seem  to  have  gone  out  of  fashion.  Beach  (pp.  172-175)  gives 
an  excellent  account  of  them  which  I  shall  not  repeat  here.  I  shall 
merely  discuss  necessary  and  sufficient  conditions  under  which  bunch 
maps  help  to  accept  or  reject  hypotheses. 

Turn  to  the  example  of  liquor  sales,  skiers,  and  cold  weather  in 
Sec.  10.5.  Let  xi,  rr2,  x$  be  the  three  standardized  variables.  Let  their 
correlation  coefficients  be 
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1     ria  ru 

if  -     r2i     1  rn 

T31      T32  1  . 
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1       0.5    0.2 

0.5    1       0.8 
0.2    0.8    1 


Compute  the  least  squares  regressions  of  all  normalized  variables,  two 

at  a  time: 

X\  =  aiiXi        Xi  =  O13X3 

xi  »  anX\        ......        x2  =  a2SX3  (10-8) 

X3  «■  O31X1        Xz  =  O32X2        


and  then  three  at  a  time: 


X\  =  Cti2.3^2  4"  013.2X3 
X2  =  021-3^1  +  O23.1X3 
Xz   =   031.2^1  +  032.1^2 


(10-9) 


Construct  now  the  unit  squares,  shown  in  Fig.  19,  where  0  marks  the 
origin.  In  each  block  the  horizontal  axis  corresponds  to  the  inde- 
pendent variable,  and  the  vertical  to  the  dependent.  The  labels 
below  the  squares  show  which  lis  which. 

Refer  now  to  the  first  equation  Xi  =  Oi2x2  in  (10-8).  From  Exercise 
10.N,  xi  =  7*12X2.  Imagine  a  unit  variation  in  *,he  independent  variable 
x2;  then  the  corresponding  variation  in  Xi,  according  to  this  equation, 
is  an.  Plot  the  point  (l,ai2)  in  the  first  block  of  squares.  Then  go 
to  the  symmetrical  equation  x2  ~  a2iXi,  make  Xi  vary  by  Axi  =  lt  and 
plot  the  resulting  point  (o2i,l)  in  the  same  block.  In  a  similar  way 
fill  out  the  top  row  of  Fig.  19,  drawing  the  beams  from  the  origin. 

In  (10-9),  first  consider  the  variation  in  xi  resulting  from  variations 
in  x2,  other  things  being  equal.:  We  get  three  different  answers  from 
(10-9),  one  per  equation:  | 


AXi   4   &12-3  AX2 

Axi  4  Ax2 

;    021-3 

A  ^32-1     A 

Axi  ~- Ax2 

031-2 


Digressing  a  little,  I  state  without  proof  that 


(10-10) 


Q»/.jfe 


R%1 
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A*2'i 


1    A*3 


1    A*3 


-1 


,  -First  place:  dependent  variable 
^-Second  place:  independent  variable 


After  dot:  variable  allowed  for 
Fig.  19.  Bunch  maps. 


(^22.^23) 


lRl2.-*13l 


1    Ax3 


Therefore  we  get  from  (10-10)  the  three  statements  that  A£i:Az2  is 
proportional  to  RuiR'n,  to  R22-R21,  and  to  —RniRw  In  the  figure 
this  is  depicted,  respectively,  by  the  beams  marked  (  12.3  ).  The  three 
regressions  in  general  conflict  both  with  regard  to  the  slope  and  with 
regard  to  the  length  of  the  beams. 

Derive  the  corresponding  relations  for  Axi :  Ax3  and  A#2 :  Az3.     These 
results  are  plotted  in  the  last  two  panels  of  Fig.  19. 
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i 

Exercise 


10. P    Plot  the  bunch  maps  for 


p  = 


1  i     -0.6     -0.1 
0.6        1  0.6 

0.1        0.6        1 


Scanning  Fig.  19  is  supposed;  to  tell  us  (1)  which  two  variables  to 
regress  if  we  want  to  stick  to  two  of  the  given  three,  and  (2)  whether  a 
third  variable  is  superfluous,  useful,  or  detrimental  in  some  very  loose 
sense. 

What  do  we  look  for  in  Fig.  19?  Three  things:  (1)  opening  or 
closing  of  bunch  maps  as  you  go  from  the  upper  to  the  lower  panel, 
(2)  shortening  of  the  beams,  and  (3)  change  of  direction  of  the  bunches. 
There  is  no  simple  intuitive  way  to  interpret  the  many  combinations 
of  1,  2,  and  3;  this  is  the  main  reason  why  statisticians  have  abandoned 
bunch  maps. 

The  examples  that  follow  far  from  exhaust  the  possibilities.  The 
moral  of  these  examples  is:  To  interpret  the  behavior  of  the  bunch 
maps,  you  must  translate  them  into  correlation  coefficients  r0  and 
try  to  interpret  what  it  means  for  the  coefficients  to  be  related  in  one 
way  or  another.  But  one  might  as  well  start  with  the  correlation 
coefficients,  bypassing  the  bunch  maps  altogether. 

Example  1.     The  vanishing  beam 

What  can  we  infer  if  beam  Ru/Rn  shrinks  in  length?  Take  the 
extreme  case  Ru  ~  0  and  Ru  «  0.  These  imply  rn  =  *Wn  and 
r|3  s:  l,  which,  in  turn,  imply  r%*  «*  ±1.  and  rn  m  ±P««  Let  us 
restrict  the  illustration  to  the  plus-sign  case  r23  =  1>  f\%  m  ?n* 

The  meaning  of  r2z  —  1  is  that  x2  and  xz  in  the  sample,  uneofrected 
for  variations  in  x\,  are  indistinguishable  variables.  Relation  r n  =  fis 
shows  that,  if  xi  and  x3  were  corrected  for  xh  the  corrections  Would  be 
identical;  the  resulting  corrected  variables  are  also  identical.  This 
can  also  be  seen  from  the  fact  that  in  these  circumstances  r a^i  ©quals  1. 
All  this  would,  of  course,  be  detectable  from  the  top  level  of  Fig.  19, 
signifying  that  three  variables  are  too  many  and  that  any  two  are 
nearly  as  good  as  any  other  two. 
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Example  2.    The  tilting  beam 

What  does  it  mean  if  beam  Ru/Rn  tilts  toward  one  axis  without 
shrinking  in  length?  For  instance,  let  Ru  7*  0  and  Ru  =  0.  This 
implies  again  that  r2;i  =  ±  1,  that  is  to  say,  xt  =  ±xs.  Taking  again 
just  the  +  case,  this  signifies  that  the  uncorrected  xt  and  x$  are  in 
perfect  agreement.  However,  Ru  —  ru  -  ru  ^  0  or  ru  9*  n3;  take 
the  case  ru  <  ru  for  the  sake  of  the  illustration.  The  inequality 
ria  /^  ru  suggests  that  the  corrections  of  x<i  and  x3  to  take  account  of 
variations  in  X\  will  be  different  corrections  and  will  upset  the  perfect 
harmony.     This  can  be  seen  again  from 

__       R23       __  7*23  —  ruru  __  1  —  rnris  .  - 

r2M      (B,JJ,0»      (1  -  rh)»(l  -  rlt)»      (1  -  r},)»(l  -  rj,)»  * 

In  terms  of  our  example,  there  is  a  spurious  perfect  correlation  between 
Xi,  the  number  of  skiers,  and  ar3,  the  weather.  It  is  spurious  because 
some  skiers  come  to  enjoy  not  the  weather  but  the  liquor.  However, 
liquor  sales  respond  less  perfectly  to  tourist  numbers  than  to  weather; 
that  is,  ri2  <  ri3.  Therefore,  if  you  take  into  account  the  fact  that 
liquor  too  attracts  skiers,  the  weather  is  not  so  perfectly  predictable  a 
magnet  for  skiers  as  you  might  have  thought  by  looking  at  r23  =  1. 
The  hypothesis  accepted  in  this  case  is:  Liquor  is  significant  and  ought 
to  be  introduced  in  a  predictive  model. 

Exercises 

10. Q  Show  that,  if  beam  ai2.3  has  the  same  slope  as  an,  this  implies 
O12.3  =  rn  and  also  ri3  =  r23  and,  hence,  that  all  three  beams  of  the 
bunch  map  come  together.     Interpret  this. 

I0.R  Interpret  the  situation  where  all  three  beams  Ru/Rn,  Rn/Rn, 
and  —Rzi/Rn  have  the  same  slope.  Must  they  necessarily  have  the 
same  length?    Must  the  common  slope  necessarily  equal  ai2? 

10.9.  Testing  for  linearity 

If  the  rival  hypotheses  are 

y  =  j3x  -f-  u    versus    y  =  yx*  +  v 

the  matter  is  quickly  settled  by  comparing  the  correlation  coefficients 
fxv  with  rxt.v.    Things  become  complicated  if  the  quadratic  function 
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contains  a  linear  term,  because  the  function  y  «■  yx*  +  fa  <+•  v  con- 
tains the  linear  function  j  =  fix  +  u  as  a  special  case;  therefore,  we 
would  expect  the  correlation  to  be  improved  by  adding  a  higher  term. 
Thus,  for  any  fit  giving  estimates  #,  7,  and  S,  rv.^xt^x)  is  bound  to  be 
greater  than  rv.tfx) .  Correlation  coefficients  do  not  give  the  best  tests  of 
linearity.  Common  sense  suggests  something  simpler  and  more 
intuitive. 
The  curves  in  Fig.  20a  represent  the  two  rival  hypotheses.     If  the 
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Fig.  20.  Tests  of  nonlinearity. 


quadratic  is  true  but  we  fit  a  straight  line,  then  the  computed  residuals 
from  the  fitted  straight  line  will  be  overwhelmingly  positive  for 
some  ranges  of  x  and  overwhelmingly  negative  for  other  ranges.  These 
ranges  are  defined  in  terms  of  the  intersections  of  the  rival  curves. 
Somewhere  left  of  A  most  residuals  arc  negative,  and  to  the  right,  most 
are  positive.  Complicated  nu  merical  formulas  for  testing  nonlinearity 
are  nothing  but  algebraic  translations  of  this  simple  test. 

All  this  generalizes  quite  rea  dily .  For  instance,  the  test  of  hypothesis 
y  =  ax  -f  u  versus  a  cubic  is  sketched  in  Fig.  206;  a  quadratic  versus 
a  cubic  in  Fig.  20c.     And  it  generalizes  into  several  variables  x,  y,  z,  etc. 
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la  each  case  the  test  consists  in  dividing  the  range  of  x  into  several 
equal  parts  Pi,  P2,  .  .  .  ,  as  shown  in  either  Fig.  21a  or  216.  In  each 
part  compute  the  average  straight-line  regression  residual  av  u.  If  this 
tends  to  vary  systematically  (with  a  trend  or  in  waves),  the  relationship 
is  nonlinear. 

When  we  have  three  or  more  variables  x>  y,  z  and  want  to  test 
linearity  versus  some  other  hypothesis,  we  have  to  extend  to  two 
dimensions  the  technique  of  Fig.  21.     Let  the  rival  hypotheses  be 

x  =  a  +  (3y  +  yz  4-  u    versus    x  =  5  +  cy  -f-  ft/2  +  yz  +  Oz2  +  *yz  +  v 

In  the  yz  plane  the  intersection  of  these  two  surfaces  projects  a 
hard-to-solve-for  and  messy  curve  KLMNP  (see  Fig.  22a).  Instead  of 
obtaining  it,  let  us  see  whether  we  can  sketch  it  vaguely.  Divide  the 
sample  range  of  y  arid  z  into  chunks,  as  shown  in  the  figure  (they  do  not 
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Fig.  21.  The  interval  test. 

need  to  be  square,  and  they  may  overlap  in  a  systematic  way  analogous 
to  Fig.  216).  In  each  chunk,  compute  the  average  linear  residual  av  u, 
and  see  whether  a  pattern  emerges.  By  drawing  approximate  contour 
lines  according  to  the  local  elevation  of  av  u,  we  may  be  able  to  detect 
mountains  or  valleys,  which  tell  us  that  the  true  relationship  is  non- 
linear. Something  analogous  can  be  done  when  both  rival  hypotheses 
are  nonlinear. 

10.10.  Linear  versus  ratio  models 

The  rival  hypotheses  here  are 

c  XI 

-  =  a-f-|3--f-w     versus    c  =  y  +  5y  +  en  +  v 

n  n 

where  u  and  v  have  the  usual  properties  to  ensure  that  least  squares 
fits  are  valid. 
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If  the  ratio  model  is  the  maintained  hypothesis,  then  we  would 
expect  av  u  to  be  constant  over  successive  segments  of  the  axis  y/n. 
Translated  into  the  projection  on  the  yn  plane,  this  means  that  av  w 
should  be  constant  in  the  successive  slices  shown  in  Fig.  226.  For  the 
linear  model,  av  v  should  be  constant  in  the  squares  of  Fig.  22c.  In 
general,  one  criterion  will  be  satisfied  better  than  the  other  and  will 
plead  for  the  rejection  of  the  opposite  hypothesis.     If  both  criteria  are 
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substantially  satisfied,  then  there  is  no  problem  of  choosing,  because 
both  formulations  say  that  cl  y,  and  n  are  related  linearly  and  homo- 
geneously (7  =s  0).  One  formulation  might  possibly  be  more  efficient 
than  the  other  for  reasons  of  "skedasticity"  (compare  Sec.  2.15). 


10.11,  Split  sectors  versus  sector  variable 

The  rival  hypotheses  here  are  whether  the  demand  for,  say,  sugar 
should  be  estimated  for  all  consumers  as  a  linear  function  of  price  and 
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income  q  =*  yp  +  6y  +  w  (where  the  price  paid  is  uncorrelated  with 
income)  or  should  be  split  into  several  demand  functions  q  =  ap  +  w, 
q  —  Pp  +  v,  etc.,  one  for  each  income  class,  on  the  ground  that  price 
means  more  to  the  poor  than  to  the  rich. 

For  illustration  it  is  enough  if  we  have  just  two  income  classes,  the 
rich  and  the  poor,  corresponding  to,  say,  y  «=  10,  y  «  1.  Nothing 
essential  would  be  added  if  y  were  taken  as  a  continuous  variable. 

As  in  Fig.  22c,  construct  a  grid  for  the  sample  range  of  variables  y 
and  p.  If  av  w  is  constant,  the  single  equation  q  —  yp  -\-  8y  -\-  w 
is  good  enough,  and,  moreover,  we  have  a  =  0  in  the  alternative 
hypothesis.  If,,  however,  the  second  hypothesis  is  correct,  not  only 
will  a  be  very  different  from  /§,  but  av  w  will  display  contours  like 
those  of  Fig.  22d. 

10.12.  How  hypotheses  are  chosen 

In  this  section  I  am  neither  critical,  nor  constructive,  nor  original. 
I  think  it  proper  to  look  at  the  way  that  statistical  hypothesis  mak- 
ing and  testing  takes  place  around  us. 

The  econometrician,  geneticist,  or  other  investigator  usually  begins 
with  (1)  prejudices  instilled  from  previous  study,  (2)  vague  impressions, 
(3)  data,  (4)  some  vague  hypotheses. 

He  then  casts  a  preliminary  look  at  the  data  and  informally  rejects 
some  because  they  represent  special  cases  (war  years,  for  instance, 
or  extremely  wealthy  people)  and  others  because  they  do  not  square 
with  the  vague  hypotheses  he  holds.  He  uses  the  remaining  data 
informally  to  throw  out  some  of  his  hypotheses,  from  among  those  that 
are  relatively  vague  and  not  too  firmly  grounded  in  prejudice. 

At  this  stage  he  may  prefer  to  scan  the  data  mechanically,  say,  by 
bunch  maps,  rather  than  impressionistically.  Mechanical  prescreen- 
ing  is  used  (1)  because  the  variables  are  many,  and  the  unaided  eye  is 
bewildered  by  them,  and  (2)  because  the  research  worker  is  chicken- 
hearted  and  distrusts  his  judgment.  Logically,  of  course,  any 
mechanical  method  is  an  implicit  blend  of  theory  and  estimating 
criteria;  but,  psychologically,  it  has  the  appearance  of  objectivity. 
The  good  researcher  knows  this,  but  he  too  is  overwhelmed  by  the 
illusion  that  mechanisms  are  objective. 

Having  done  all  this,  the  investigator  at  long  last  comes  to  specifica- 
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tion  (as  described  in  Chap.  1) ;  he  then  estimates,  accepts,  rejects,  or 
samples  again. 

This  stage-by-stage  procedure  is  logically  wrong,  but  economically 
efficient,  psychologically  appealing,  and  practically  harmless  in  the 
hands  of  a  skilled  researcher  with  a  feel  for  his  area  of  study. 

Instead  of  proceeding  stage  by  stage,  is  there  a  way  to  let  the  facts 
speak  for  themselves  in  one  grand  test?  The  answer  is  no.  We  must 
start  with  some  hypothesis  or  we  do  not  even  have  facts.  True, 
hypotheses  may  be  more  or  less  restrictive.  But  the  less  restrictive  the 
hypotheses  are,  the  less  a  given  body  of  data  can  tell  us. 

Further  readings 

For  rigorous  treatment  of  the  theory  of  hypothesis  testing,  one  needs  to 
know  set  theory  and  topology.  Klein's  discussion,  pp.  56-62,  gives  a  good 
first  glimpse  of  this  approach  and  a  good  bibliography,  p.  63. 

For  treatment  of  errors  in  the  variables,  consult  Trygve  Haavelmo,  "Some 
Remarks  on  Frisch's  Confluence  Analysis  and  Its  Use  in  Econometrics," 
chap.  V  in  Koopmans,  pp.  258-2C5. 

Beach  discusses  bunch  maps  and  the  question  of  superfluous,  useful,  or 
detrimental  variables,  pp.  174-175.  Tinbergen,  pp.  80-83,  shows  a  five- 
variable  example. 

Cyril  H.  Gouldcn,  Methods  of  Statistical  Analysis,  2d  ed.,  chap.  7  (New 
York:  John  Wiley  &  Sons,  Inc.,  1952),  gives  an  elementary  discussion  of 
p  and  the  sample  properties  of  its*  estimate  r. 


CHAPTER  11 


Unspecified  factors 


11.1.  Reasons  for  unspecified  factor  analysis 

Having  specified  his  explanatory  variables,  the  model  builder  fre- 
quently knows  (or  suspects)  that  there  are  other  variables  at  work  that 
are  hard  to  incorporate. 

1.  The  additional  variable  (or  variables)  may  be  unknown,  like  the 
planet  Neptune,  which  used  to  upset  other  orbits. 

2.  The  additional  variable  may  be  known  but  hard  to  measure.  For 
instance,  technological  change  affects  the  production  function,  but 
how  are  we  to  introduce  it  explicitly? 

There  are  two  ways  out  of  this  difficulty:  splitting  the  sample,  and 
dummy  variables.  When  we  split  the  sample  we  fit  the  production 
function  to  each  fragment  independently  in  the  hope  that  each  frag- 
ment is  uniform  enough  with  regard  to  the  state  of  technology  and  yet 
large  enough  to  contain  sufficient  degrees  of  freedom  to  estimate  the 
parameters.  The  technique  of  dummy  variables  does  not  split  the 
sample,  but  instead  introduces  a  variable  that  takes  on  two  and  only 
two  values  or  levels:  0  when,  say,  there  is  peace,  and  1  when  there  is  war. 
Phenomena  that  are  capable  of  taking  on  three  or  more  distinct  states 
are  not  suited  to  the  dummy  variable  technique.     For  instance,  it 
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would  not  do  to  count  0  for  peace,  0.67  for  cold  war,  and  1  for  shooting 
war,  because  this  would  impose  an  artificial  metric  scale  on  the  state 
of  world  politics  which  would  affect  the  parameters  attached  to  honest- 
to-goodness,  truly  measurable  variables.  No  artificial  metric  scale  is 
introduced  by  the  two-level  dummy  variable. 

3.  The  additional  factors  at  work  may  be  a  composite  of  many 
factors,  too  many  to  include  separately  and  yet  not  numerous  enough 
or  independent  enough  of  one  another  to  relegate  to  the  random  term 
of  the  equation.  j 

4.  The  additional  variable  may  be  known  and  measurable,  but 
we  may  not  know  whether ,  to  include  it  linearly,  quadratically,  or 
otherwise.  j 

5.  The  additional  variable  may  be  known,  measurable,  etc.,  but 
not  simple  to  put  in.  To  admit  a  wavy  trend  line,  for  instance,  eats 
up  several  degrees  of  freedom. 

I 
In  such  cases  the  unspecified  variable  technique  comes  to  our  rescue, 
at  a  price,  because  it  sometimes  requires  special  knowledge.  In  the 
illustration  of  Sec.  11.2,  for  instance,  to  estimate  a  production  function 
that  shifts  with  technological  change,  time  series  are  not  enough.  The 
data  must  contain  information  about  inputs  and  outputs  broken  down, 
say,  by  region,  or  in  some  dimension  besides  chronology. 

i 
11.2.  A  single  unspecified  variable 

This  section  is  based  on  the  technique  developed  by  C.  E.  V.  Leser1 
in  his  study  of  British  coal  mining  during  1943-1953,  years  of  rapid 
technological  change,  nationalization,  and  other  disturbances. 

He  fitted  the  function  Prt  ~  QtL"fini  where  P  is  product,  L  is  labor, 
C  is  capital,  gt  is  the  unspecified  impact  of  technology,  r  and  t  are 
regional  and  time  indices,  and  a,  0  are  the  unknown  parameters. 

Here  for  exposition's  sake,  I  shall  linearize  his  model  and  drop  the 
second  specified  variable.    Consider  then 

Prt  «  gt  +  aLrt  +  Urt  (11-1) 

1  C.  E.  V.  Leser,  "Production  Functions  and  British  Coal  Mining"  (Rfflnometrica, 
vol.  23,  no.  4,  pp.  442-446,  October,  1955). 
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The  following  assumptions  are  made: 

1.  Technology  affects  all  regions  equally  in  any  moment  of  time. 

2.  The  same  production  function  applies  to  all  regions. 

3.  The  random  term  is  normal,  with  a  period  moan 
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equal  to  zero,  and  a  regional  mean 
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also  equal  to  zero.  We  shall  now  use  the  notation  av[r]itrt  and  av[^]ur< 
for  expressions  like  the  last  two. l 

Now,  keeping  time  fixed  at  it  «■  1,  let  us  average  inputs  and 
outputs  over  the  R  regions.  From  (11-1)  we  get,  remembering  that 
av[r](7f,  -  gh} 

av[r]Pr<t  -  gtl  +  a  av[r]Lrt,  (11-2) 

And,  by  subtracting  (11-2)  from  (11-1),  we  get  the  following  relation 
between  P'ril  and  Urtl,  which  are  product  and  labor  measured  from  their 
mean  values  of  period  1 : 

P'rh  -  «I4  +  urh  (11-3) 

Do  the  same  for  t  =  2,  ,  .  .  ,  T  and  then  maximize  the  likelihood  of 
the  sample.  Under  the  usual  assumptions,  this  is  equivalent  to 
minimizing  the  sum  of  squares 

rt 

The  resulting  estimate  of  a  is 

&  =  ^  (11-4) 

ThL'L' 

In  this  expression  the  moments  are  sums  running  over  all  regions  and 
time  periods. 

1  Read  "average  over  the  r  regions,"  "average  over  the  t  years." 
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Having  found  a,  we  can  go  pack  to  (11-2)  to  compute  the  time  path 

gt  of  the  unspecified  variable,  technology. 
The  method  I  have  just  outlined  has  several  advantages: 

1.  It  uses  R  XT  observations  (a  largo  number  of  degrees  of  freedom) 
in  estimating  the  parameter  al. 

2.  Unlike  split  sampling,  it  i  obtains  a  single  parameter  estimate  for 
all  regions  and  periods. 

3.  It  yields  us  an  estimate  of  the  unspecified  variable  (technological 
change),  if  it  is  the  only  other  factor  at  work. 

4.  This  technological  change  does  not  have  to  be  a  simple  function 
of  time.  It  may  be  secular,  cyclical,  or  erratic;  it  can  be  linear, 
quadratic,  or  anything  else.    ! 

5.  The  method  estimates,  in  addition  to  technology,  the  effects  of 
any  number  of  other  unspecified  variables  (such  as  inflation,  war, 

nationalization)  which  at  any  moment  may  affect  all  regions  equally. 

i 

The  chief  disadvantage  of!  the  technique  is  that  the  unspecified 
variable  gt  has  to  be  introduced  in  a  manner  congenial  to  the  model, 
that  is  to  say,  as  a  linear  term  in  a  linear  model,  as  a  factor  in  Leser's 
logarithmic  model,  and  so  forth  j  otherwise  it  would  not  drop  out, 
as  in  (11-3)  when  we  express  the  specified  variables  as  departures 
from  their  average  values. 

For  the  unspecified  variable  technique  to  be  successful  it  is  necessary 
that  the  data  come  classified  in  one  more  dimension  than  there  are 
unspecified  variables.  Thus  P  and  L  must  have  two  subscripts. 
Moreover,  each  region  must  have  coal  mines  in  each  time  period.1 


11.3.  Several  unspecified!  variables 

Imagine  now  that  we  wish  to  explain  retail  price  P  in  terms  of  unit 
cost  C,  distance  or  location  D,  monopoly  M,  and  the  general  level  of 
inflation  J.  Cost  is  the  specified  variable,  and  location,  monopoly, 
and  inflation  are  left  unspecified  for  one  or  another  of  the  reasons  I 

1  There  are  methods  for  treating  lacunes,  or  missing  data,  but  these  are  rather 
elaborate  and  will  not  be  discussed  in  this  work.  The  usual  way  to  treat  a  lacune 
is  by  pretending  it  is  full  of  data  that  interpolate  perfectly  in  whatever  structural 
relationship  is  finally  assigned  to  the  original  data. 
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recounted  in  Sec.  11.1.    The  model,  assumed  to  be  linear,  is 

Pfirt  =  Mi  +  Dr+Jt  +  aCfM  +  Ufirt  (11-5) 

where  the  subscripts  /,  t,  r,  t  express  firm,  industry,  region,  and  time. 
The  model,  as  written,  maintains  that  the  degree  of  monopoly  is  a 
property  of  the  industry  only,  not  of  the  region  or  of  inflationary 
situation  or  of  interactions  among  the  three.  Similarly,  inflation  is 
solely  a  function  of  the  time  and  not  of  the  degree  of  monopoly  and 
location  of  industry.  Note  again  that  the  data  have  to  come  classified 
in  one  more  dimension  than  there  are  unspecified  variables.  Thus  P 
and  C  must  have  four  subscripts,  one  for  each  of  the  unspecified 
variables,  plus  an  extra  one  (firm  /).  Moreover,  unless  we  have 
lacunes,  each  firm  must  be  present  in  each  industry,  region,  and  time 
period.  The  firms  of  Montgomery  Ward  and  Sears  Roebuck  would 
do,1  and  the  industries  they  enter  can  be,  say,  watch  retailing,  tire 
retailing,  clothing  retailing,  etc. 

In  that  case,  a  is  estimated  analogously  to  (11-4)  by  &  =  rnp'c'/mC'c'f 
where  the  moments  are  sums  running  over  /,  t,  r,  t.  Having  esti- 
mated a,  we  can  now  define  a  new  variable  S,  the  price-cost  spread 
S  =  jP  -  dC.     The  model  is  now 

Sfrt  =  Mi  +  Dr  +  Jt  +  vfirt  (11-6) 

Estimating  M ,  Z),  and  J  is  the  so-called  problem  of  linear  factor  analysis. 

11.4,  Linear  orthogonal  factor  analysis 

Linear  factor  analysis  attempts  to  explain  the  spread  S  as  an  additive 
resultant  of  two  or  more  separate  factors;  in  the  example  of  (11-6) 
there  are  three  factors:  monopoly,  region,  and  inflation. 

Nothing  essential  is  lost  if  we  confine  ourselves  to  two  factors,  say, 
monopoly  and  inflation,  and  consider  the  simpler  model 

Sf*  -  Mi  +  Jt +  p,u  (H-7) 

To  grasp  its  essence,  imagine  that  there  are  no  random  disturbances 
(y  =  0)  and  that  there  is  only  one  firm,  which  sells  three  products 

1  Provided  both  exist  in  all  time  periods,  regions,  and  industries  included  in 
the  sample. 
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(tires,  watches,  clothes)  over  5  years.  Observations  can  be  put  in  a 
3-by-5  table  or  matrix  whose  rows  correspond  to  the  commodities  and 
columns  to  the  years: 


S  = 


Factor  analysis  seeks  to  express  this  table  as  the  sum  of  two  tables 
M  and  J  of  similar  dimensions,  the  first  with  constant  rows  and  the 
second  with  constant  columns*. 


st, 

$12 

$13 

Su 

$15 

$21 

$22 

$23 

$24 

$25 

$31 

$32 

$33 

$34 

$35 

M  = 


Mi  Mx  Mr  Mr  j  Mr 
Mi  Mi  M2  Mt  \  M2 
Mi    Mz    Mz    M3  i  Mz 


J  = 


J\      Ji      J*      J\      */§ 

Jl      Ji      Jz      Ji      JjJ 
J\      Jz      Jz      Ji      J& 


In  a  practical  problem  this  cannot  be  done  exactly,  particularly  if 
several  firms  are  involved.  This  is  the  familiar  problem  of  conflicting 
observations,  which  is  treated  in  Sec.  7.2.  In  practice,  some  com- 
promise is  found  which  gives  the  M  and  J  that  "fit  best"  the  observa- 
tions S. 

A  graphic  way  to  express  the  problem  of  factor  analysis  is  the 
following.  You  are  given  a  rectangular  piece,  say,  3  by  5  miles,  of  a 
topographical  map  with  contour  lines  showing  the  elevation  at  various 
spots.  You  are  supposed  to  find  a  landscape  profile  running  from 
north  to  south  and  another  one  running  from  east  to  west  with  the 
property  that,  if  you  slide  the  bottom  of  the  first  perpendicularly  along 
the  humps  and  bumps  of  the  second,  the  top  crests  describe  the  original 
surface  of  the  3-by-5  map.  The  same  happens  if  you  interchange  the 
roles  of  the  two  profiles.  The  two  profiles  are  kept  always  perpen- 
dicular to  each  other;  and  this  is  why  the  literature  calls  the  two  fac- 
tors M  and  J  orthogonal  (that  is  to  say,  right-angled) .     (See  Fig.  23.) 

Computing  differences  among  the  various  entries  in  M  and  J  is  a 
simple  matter  under  the  usual  assumptions.  Again,  we  minimize  the 
expression 


2' 

fit 


with  respect  to  Mh  Mif  Mz,  Ji,  J  2,  Ji,  Ji,  Ji-    Thus  the  solution  for 
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Mi  is 


and  that  of  J%  is 
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{J*»-'2*} 


FT 


^3  =  -^—  -i~ 


F/ 


(11-8) 


(H-9) 


where  F,  /,  jT  are  the  total  number  of  firms,  industries,  and  time 
periods,  respectively.  Note  that,  to  estimate  the  degree  of  monopoly 
in  the  first  industry,  we  need  knowledge  of  inflation  in  all  years;  to 
estimate  inflation  in  year  3,  we  need  measures  *  "  monopoly  for  all 


Elevation 


Elevation 


y*K        v/\y-"~~ 


North     3  miles     South  West 

Fig.  23 


5  miles 


East 


industries.  Equation  (11-8)  can  be  rationalized  as  follows:  to  estimate 
the  effect  of  monopoly  in  the  first  industry,  disregard  the  price-cost 
spread  in  all  other  industries,  and  compute  the  over-all  (firm-to-firm 
and  period-to-period)  average  spread  in  industry  1 : 


£S/u 


FT 


From  this  deduct  the  average  inflationary  impact 

2* 


What  is  left  is  the  monopoly  impact. 
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11.5.  Testing  orthogonality 

It  is  entirely  possible  for  inflation's  impact  on  the  price-cost  spread 
to  be  related  to  monopoly.  Indeed,  there  is  evidence  from  th©  Second 
World  War  that  price  control  was  more  successful  in  monopolistic 
industries  (and  firms)  than  in  competitive  ones.  A  monopolist  or 
monopolistic  competitor  is  recognized  and  remembered  by  the  public. 
If  he  takes  advantage  of  inflation,  he  may  lose  goodwill  or  perhaps  be 
sued  by  the  government  as  an  example  to  others.  If  monopoly  and 
inflation  interact  in  this  way  or  in  some  other  way,  the  linear  model 
(11-7)  is  not  applicable.  Because  it  is  simple,  however,  we  may  adopt 
it  as  our  null  hypothesis,  fit  b,  and  look  for  a  systematic  pattern  dis- 
crepancy as  a  test  of  the  hypothesis. 

The  formulas  for  doing  this  are  rather  complicated  expressions, 
which  I  shall  not  bother  to  state.  Intuitively  the  test  is  quite  simple. 
If  by  rearranging  whole  rows  f.nd  whole  columns,  table  S  can  be  made 
to  have  its  highest  entry  in  t,ho  upper  left-hand  comer,  its  smallest 
entry  in  the  lower  right-hand  corner,  with  each  row  and  column 
stepping  down  by  equal  amounts,  the  null  hypothesis  holds.  For 
example, 


s  = 

[S 

15 
17 

5] 

can  be  rearranged  thus: 

S'  = 

■[S 

14 
12 

I?] 

Note  that 

S'  = 

[-15 

15 

12 
12 

Sl+R 

o\  <n-10> 

To  state  the  same  test  in  terms  of  our  geographic  profiles  of  Sec.  11.4: 
Cut  up  the  original  map  into  north-south  strips,  rearrange,  and  then 
glue  them  together.  Then  cut  the  resulting  map  into  east-west  strips 
and  rearrange  these.  Should  this  procedure  produce  a  map  of  a  terri- 
tory (1)  sloping  from  its  northwest  corner  down  to  its  southeast  corner, 
(2)  with  neither  local  hills  nor  saddle  points,  and  (3)  such  that,  if  you 
stand  anywhere  on  a  given  geographical  parallel  and  take  one  step 
south,  you  step  down  by  an  equal  amount,  say,  3  feet,  and  (4)  such 
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that,  likewise,  if  you  start  from  any  point  on  a  fixed  meridian,  one 
eastward  step  loses  the  same  elevation,  say,  2.1  feet,  then  the  factors 
are  orthogonal. 

In  arithmetical  terms,  having  estimated  Afi,  M2,  .  .  .  ,  Js,  rearrange 
the  rows  and  columns  so  that  the  most  monopolistic  industry  occupies 
the  top  row  and  the  most  inflationary  year  occupies  the  leftmost 
column.    Compute  the  residuals 

tfu  =  Sfu  -  ifti  -  Ji 
and  place  their  sums 

/ 

in  the  appropriate  row  and  column.  Any  run,  or  large  local  concentra- 
tion, of  mostly  positive  or  mostly  negative  residuals  is  evidence  that 
monopoly  and  inflation  have  interaction  effects  (are  not  orthogonal 
factors). 

11.6.  Factor  analysis  and  variance  analysis 

Unspecified  factor  analysis,  the  technique  explained  in  this  chap- 
ter, should  be  carefully  distinguished  from  variance  analysis  (and 
from -factor  analysis  in  the  principal  components  sense  of  the  term). 
Both  techniques  make  use  of  a  row-column  classification,  and  both 
usually  proceed  on  the  null  hypothesis  that  rows  and  columns  do 
not  interact.  But  here  the  similarities  end.  Factor  analysis  meas- 
ures the  row  and  column  effects  for  each  row  and  column,  i.e.,  it 
computes  the  unspecified  variable.  Variance  analysis  attributes 
various  percentages  of  total  variance1  to  differences  among  all  rows,  to 
differences  among  all  columns,  and  the  remainder  to  chance.  Factor 
analysis endswith / -f-  T estimates $1 u&t,  .  .  .  ,lSti\3i,J%9  .  .  .  ,«/r. 
Variance  analysis  ends  with  three  percentages  expressing  row  variance, 
column  variance,  and  unexplained  variance  in  terms  of  total  variance. 

1  Total  variance  in  terms  of  the  example,  model  (11*7),  is 

Y0S/«  -avS)» 

fit 

FIT 
where  av  S  is  av[/i7]<S/,r,  or  the  average  spread  over  the  entire  sample. 
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In  the  course  of  analysis  of  variance,  row  means  (14%  and  12%)  and 
column  means  (16,  13,  and  12)  are  computed,  but  they  are  only  aux- 
iliary quantities,  not  estimates  of  factor  impacts.  However,  the 
differences  in  these  two  sets  of  means  are  equal  respectively  to  the 
differences  in  the  impact  [(2  and  0)  and  (15,  12,  and  11)]  of  the  two 
variables  into  which  S'  is  factorable  [see  equation  (11-10)]. 

It  is  not  my  intention  to  go  into  the  details  of  variance  analysis. 
Just  three  comments  about  it: 

1.  The  reason  why  people  analyze  variance  and  not  the  fourth  or 
seventeenth  moment  of  the  sample  is  this:  A  normal  distribution  with 
zero  mean  (such  as  the  error  term  V/u)  can  be  completely  described  by 
its  variance.  The  variance  is  a  sufficient  estimate,  for  it  contains  all 
the  information  that  is  implicit  in  the  assumed  distribution. 

2.  Under  orthogonality,  row,  column,  and  unexplained  variances 
add  up  to  total  variance,  just  as  the  square  on  the  hypotenuse  equals 
the  sum  of  the  squares  on  the  other  sides  of  a  right-angled  (orthogonal) 
triangle. 

3.  Under  normality  and  orthogonality,  variance  ratios  have  certain 
convenient  distributions,  which  are  suitable  for  testing  the  null 
hypothesis  (that  rows  or  columns  differ  only  by  chance). 

Further  readings 

Harold  W.  Watts,  "Long-run  Income  Expectations  and  Consumer  Saving," 
in  Studies  in  Household  Economic  Behavior,  by  Dernburg,  Rosett,  and  Watts 
(Yale  Studies  in  Economics,  vol.  9,  pp.  103-144,  New  Haven,  Conn.,  1958), 
makes  judicious  use  of  dummy  variables. 

Robert  M.  Solow,  "Technical  Change  and  the  Aggregate  Production 
Functions"  (Review  of  Economics  and  Statistics,  vol.  39,  no.  3,  pp.  312-320, 
August,  1957),  computes  the  unspecified  variable  "technology"  not,  as  we 
have  done  in  Sec.  11.2,  by  interregional  aggregation,  but  by  using  the  marginal 
productivity  theory  of  distribution. 

Variance  analysis  is  a  vast  subject.    See  Kendall,  vol.  2,  chaps.  23  and  24. 


CHAPTER  12 

Time  series 


12.1.  Introduction 

A  time  series  x(t)  —  [x(l),  .  .  .  ,  x(T)]  is  a  collection  of  readings, 
belonging  to  different  time  periods,  of  some  price,  quantity,  or  other 
economic  variable.  We  shall  confine  ourselves  to  discrete,  consecutive, 
and  equidistant  time  points. 

Like  all  the  kinds  of  manifestations  with  which  econometrics  deals, 
economic  time  series,  both  singly  and  in  combination,  are  generated 
by  the  systematic  and  stochastic  logic  of  the  economy.  The  same 
techniques  of  estimation,  hypothesis  searching,  hypothesis  testing,  and 
forecasting  that  work  elsewhere  in  econometrics  work  also  in  time 
series. 

Why  then  a  chapter  on  time  series?  Why  indeed,  were  it  not  for  the 
large  amount  of  muddle  and  confusion  we  have  inherited  from  many 
decades  of  well-intentioned  but  faulty  investigations. 

The  earliest  and  most  abused  time  scries  are  charts  of  the  business 
cycle  and  security  market  behavior.  Desiring  knowledge,  business 
cycle  "physiologists"  avoided  all  models,  assumptions,  and  hypotheses 
in  the  hope  that  the  facts  would  speak  for  themselves.  Pursuing 
profit,  stock  market  forecasters  have  sought  and  are  seeking  (and  their 
clients  are  buying)  short  cuts  to  strategic  extrapolations;  they  have 
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cared  nothing  about  the  logic,  whether  of  the  economy  or  of  their 
methods.     Their  Economistry  is  the  crassest  of  alchemies. 

The  key  ideas  of  this  chapter  are  these:  Facts  never  speak  for 
themselves.  Every  method  of  looking  at  them,  every  technique  for 
analyzing  them  is  an  implicit  econometric  theory.  To  bring  out  the 
implicit  assumptions  for  a  critical  look,  we  shall  study  averages, 
trends,  indices,  and  other  very  common  methods  of  manipulating  data. 

I  do  not  mean  to  condemn  the  traditional  approaches  altogether. 
Certainly,  physiology  and  "mere"  description  can  do  no  harm — for 
ultimately  they  are  the  sources  of  hypotheses.  To  look  for  quick, 
cheap,  and  simple  short  cuts  to  forecasting  is  a  reasonable  research 
endeavor.  Furthermore,  modern  machines  can  help  by  doing  much  of 
the  dull  work,  provided  that  an  intelligent  being  is  available  to  study 
their  output. 

12.2.  The  time  interval 

Up  to  now  I  have  carefully  avoided  any  discussion  of  time.    In  the 

model  of  Chap.  1 

Ct  -  cc  +  yZt  +  ut  (12-1) 

what  does  t  —  1,  2,  .  .  .  ,  T  represent,  and  why  not  select  different 
intervals? 

The  secret  is  that  the  time  interval  t,  the  parameters  a  and  y,  the 
variables  C  and  Z,  and  the  stochastic  term  u  must  be  defined  not 
without  thought  of  but  with  regard  to  one  another.  If  the  time 
interval  is  short,  then  y  must  be  the  short-run  marginal  propensity  to 
consume.  If  t  is  a  year,  then  it  makes  sense  for  Z  to  be  treated  as 
predetermined.  As  the  time  interval  is  shortened,  more  and  more 
variables  change  from  predetermined  to  simultaneously  determined. 
With  shorter  and  shorter  time  periods,  the  causes  that  generate  the 
random  terms  overlap  more  and  more  and  invalidate  the  assumption  of 
serially  independent  random  disturbances. 

In  certain  cases  we  deliberately  reduce  the  number  of  time  Intervals 
of  our  data  in  order  to  bring  time  into  agreement  with  the  parameters 
and  stochastic  assumptions.  For  example,  if  we  are  trying  to  estimate 
a  production  or  cost  function  and  have  hourly  data  for  inputs  and 
outputs,  we  may  lump  these  into  whole  working  days;  otherwise  the 
disturbances  during  the  morning  warm-up  period,  coffee  break,  lunch 
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time,  and  the  Various  peak  fatigue  intervals  are  not  drawn  from  the 
same  Urn  of  Nature. 

The  smoothing  of  time  series  must  be  done  with  care.  In  the  above 
example,  if  the  purpose  is  to  make  the  random  disturbance  come  from 
the  same  Urn  in  each  interval,  then  overlapping  as  well  as  nonover- 
lapping  workdays  will  do.  If  we  also  want  the  disturbances  to  be 
serially  independent,  then  only  nonoverlapping  days  should  be  used. 

Digression  on-  moving  averages  and  sums 

Moving  averages  differ  from  moving  sums  only  by  a  constant 
factor  P  equal  to  the  number  of  original  intervals  smoothed 
together. 

If  P  is  even  ( =  2N)  the  average  or  sum  should  be  centered 
on  the  boundary  between  intervals  N  and  N  +  1.  If  2N  -+-  1 
intervals  are  averaged,  center  on  the  (N  -f*  l)st. 

There  are  many  smoothing  methods  besides  the  unweighted 
moving  average.  We  are  free  to  decide  on  the  span  P  of  the 
moving  average  and  on  the  weight  to  be  given  each  position  within 
the  span.  Given  P  successive  points,  we  may  wish  to  fit  to 
them  a  least  squares  quadratic,  logistic,  or  other  curve.  Every 
particular  curve  implies  a  particular  set  of  weights,  and  con- 
versely. Fitting  a  polynomial  of  degree  Q  through  P  points  can 
be  approximated  by  taking  the  simple  moving  average  of  a  simple 
moving  average  of  a  simple  moving  average  .  .  .  enough  times 
and  with  mi  table  spans. 

All  this  is  straightforward  and  rather  dull,  unaccompanied 
by  theoretical  justification.  What  makes  moving  averages 
interesting  is  the  claim  that  they  can  be  used  to  determine  and 
remove  the  trend  of  a  time  series.  We  shall  see  in  Sec.  12.8  how 
dangerous  a  technique  this  is.  As  we  shall  see  in  Sec.  12.5,  mov- 
ing averages  give  rise  to  broad  oscillations  where  none  exist  in 
the  original  series. 

12.3.  Treatment  of  serial  correlation 

The  term  serial  con-elation,  or  autocorrelation ,  means  the  noninde- 
pendence  of  the  values  ut  and  ut-e  of  the  random  terms.  The  term 
autorcgrcssion  applies  to  values  avand  xt-$  when  cov  (xt)xt-o)  ^  0. 
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In  this  section  we  consider  briefly  (1)  the  sources  of  serial  correlation, 
(2)  its  detection,  (3)  the  allowances  and  modifications,  if  any,  that  it 
occasions  in  our  estimating  techniques,  (4)  the  consequences  of  not 
making  these  allowances  and  modifications. 

Random  terms  are  serially  correlated  when  the  time  interval  l  is 
too  short,  when  overlapping  observations  are  used,  and  when  the  data 
from  which  we  estimate  were  constructed  by  interpolation.  Thus, 
if,  in 

Ct  -  a  +  yZt  +  ut 

t  measures  months  or  weeks,  then  the  random  term  has  to  absorb  the 
effects  of  the  months'  being  different  in  length,  weather,  and  holidays, 
effects  which  are  not  random  in  the  short  period  but  which  follow 
a  cycle  of  365  days.  If,  however,  t  is  measured  in  years,  then  all  these 
influences  are  equalized,  one  year  with  another,  and  ut  loses  some  of 
its  autocorrelation.  Similarly,  if  successive  sample  points  are  dated 
"January  to  December/'  "February  to  January,"  "March  to  Febru- 
ary," and  so  on,  successive  random  terms  are  correlated  at  least  10/12 
(10  being  the  number  of  months  common  to  successive  samples). 

Frequently  the  raw  materials  of  econometric  estimation  are  con- 
structed partly  by  interpolation.  For  instance,  there  is  a  ceniui  in 
1950  and  in  I960.  Annual  sample  surveys  in  1951,  1952,  .  .  .  meas- 
ure births,  deaths,  and  migrations;  these  data,  cumulated  from  1950, 
should  square  with  the  census  population  figure  of  1900.  Slnej  this 
seldom  happens,  the  discrepancy  in  the  final  published  figures  tl 
apportioned  (in  general,  equally)  among  the  several  years  of  thy  ii@id@s 
The  resulting  annual  figures  for  birth  rate,  etc.,  share  equal  portions 
of  a  certain  error  of  measurement  and  are,  therefore,  correlated  more 
than  they  otherwise  would  be.  In  a  model  that  uses  annual  data  on 
the  birth  rate  and  assumes  that  it  is  measured  without  error,  it  is  the 
random  term  that  absorbs  the  year-to-year  correlation. 

We  shall  illustrate  with  the  simple  model  (12-1).  There  are 
two  ways  to  detect  serial  correlation.  One  is  to  maintain  the  null 
hypothesis  that  none  exists: 

cov  (uhUt-$)  -  0  (12-2) 

estimate  the  model  on  this  assumption,  and  then  check  whether 
m(t/,)(ti,-*;  is  near  zero.     The  other  way  is  to  maintain  that  the  random 
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disturbances  do  have  a  serial  connection,  such  as 

ut  =  f  iw«-i  +•"••+  teUt-e  +  vt  (12-3) 

(with  vt  random  and  nonautocorrelated),  and  estimate  the  fs  to  see 
whether  they  are  significantly  different  from  zero. 

The  first  method  is  arithmetically  easier,  though  a  less  powerful  test. 
This  requires  explanation.  The  likelihood  function  of  our  sample  is 
the  same  as  (2-4) : 

L  -  (2tt)-5/2  det  (dttU)-*  exp  [-Mu(*uu)-lu] 

where  u  stands  for  th(3  successive  random  disturbances  (ui,W2,  .  .  .  ,Us). 
In  minimising  L  by  a  and  7,  we  should  get  the  greatest  efficiency  in 
&  and  $  if  we  took  account  of  the  fact  that  d  is  no  longer  diagonal 
when  there  is  serial  correlation  among  the  disturbances.  The  null 
hypothesis  cov  (ut)u>-e)  =  0,  though  it  does  not  bias  61  or  1  or  make 
them  inconsistent,  does  nevertheless  increase  their  sampling  variances 
and  covariances.  The  #'s  are  computed  with  the  help  of  the  inefficient 
&  and  i  and  are  themselves  inefficient  estimates  of  the  true  dis- 
turbances. Therefore  w^xa..*)  is  an  inefficient  (i.e.,  overspread) 
estimator  of  cov  (ut,vt-$)  and  provides  a  flabby  test  of  serial  correlation. 
It  does  not  reject  the  null  hypothesis  with  so  much  confidence  as  a 
more  powerful  test  (i.e.,  one  associated  with  a  very  pinched  distribution 

W(tf.)<«._0))- 

Instead  of  testing  by  m^,)^,_B)  it  is  recommended  that  we  compute 
the  expression 

■which  happens  to  have  convenient  properties,  which  are  of  no  concern 
to  the  present  discussion.  It  is  easily  seen  that,  if  m^^^g)  =  0,  then 
D(0)  =  2.  Large  departures  from  this  value  indicate  that  the  null 
hypothesis  is  untrue. 

The  second  method  for  taking  into  account  the  serial  correlation  of 
the  random  disturbance  is  more  efficient  than  the  first,  but  biased. 
To  see  this,  consider  the  special  case 

Ct^a  +  yZt  +  Ut  (12-4) 

ut  =  fu,-i  +  vt  (12-5) 
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As  good  simultaneous-approach  proponents,  we  combine  the  two 
equations  as  follows: 

Ct  -  fC«-i  -  a  +  y(Zt  -  fZ«..i)  +  Vt  (12-6) 

and  maximize  the  joint  likelihood  of  the  random  disturbances  v  with 
respect  to  the  three  parameters  a,  ?,  f .  Unfortunately,  not  only  does 
this  lead  to  a  high-order  system  of  equations,  but  the  maximum  likeli- 
hood estimates  are  biased.  The  reason  for  bias  is  the  sam§  as  in 
Chap.  3,  namely,  that  (12-6)  is  a  model  of  decay. 

There  is  yet  a  third  method,  which  is  somewhat  biased  and  somewhat 
inefficient.  First  fit  (12-4)  by  least  squares,  ignoring  (12-5):  this  step 
is  inefficient.  Then  compute  f  from  (12-5)  using  the  residuals 
H  of  the  previous  step:  this  introduces  the  bias.  Next  construct 
the  new  variables  ct  =  Ct  —  fC*_i,  zt  ~  Zt  —  l%t-\  and  fit  by  least 
squares  ct  =  a  +  yzt  +  wt  to  get  a  new  approximation  to  a  and  7. 
Repeat  the  cycle  any  number  of  tfe\?s. 

When  several  equations  have  autocorrelated  error  terms,  this  biased 
second  method  always  works  in  principle.  The  first  and  third  methods 
are  dangerous  to  use  because  we  know  practically  nothing  about  how 
good  Sm^t_e/(S  —  6)mn  is  as  an  estimator  of  the  regression  coeffi- 
cient of  ut  on  ut-e',  nor  do  we  know  whether  the  cyclical  procedure  of  the 
third  method  converges. 

Matters  get  rapidly  worse  the  more  complicated  the  dependence  of 
ut  on  its  past  values. 

12.4.  Linear  systems 

Most  business  cycle  analysis  proceeds  on  the  assumption  (sometimes 
explicitly  stated,  more  often  not)  that  an  economic  time  series  x(t) 
is  made  up  of  two  or  more  additive  components  f(t),  g(t)f  .  .  .  called 
the  "  trend,"  the  "cycle,"  the  " seasonal, "  and  the  " irregular."  Trend, 
cycle,  and  seasonal  are  supposed  to  be,  in  some  relevant  sense,  rather 
stable  functions  of  time;  the  irregular  is  not.  We  shall  use  the  expres- 
sions "irregular,"  "random  component,"  "error,"  and  "disturbance" 
interchangeably.  The  word  "additive"  signifies,  as  usual,  lack  of 
interaction  effects  among  the  components.1 

In  analyzing  time  series,  the  problem  is  to  allocate  the  observed 

1  See  Sec.  1.11. 
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fluctuations  in  x  to  its  unknown  additive  components: 

*(<)  -  /(0  +  9(f)  +  *W  +  Wf  (12-7) 

and  to  find  the  shapes  of/,  <7,  and  Jr.  whether  they  are  straight  lines, 
polynomial  or  trigonometric  functions,  or  other  complicated  forms. 

As  stated,  the  problem  is  indeterminate.  The  facts  will  never 
tell  us  either  how  many  additive  terms  the  expression  in  (12-7) 
should  have  or  what  shapes  are  best.  As  usual,  we  must  maintain  a 
hypothesis — that  the  trend  is,  say,  a  straight  line: 

that  the  cycle  is  some  trigonometric  function,  e.g., 

7  sin  (5  -h  et) 

and  so  forth,  the  problem  being  to  estimate  the  Greek  letters  from  data 
or  to  see  how  well  a  given  formulation  fits  in  comparison  with  some 
rival  hypothesis. 

Trigonometric  functions  can  be  approximated  by  lagged  expressions, 
such  as 

0(0  -  7o  +  yix(t  - 1)  +  y2x(t  -  2)  +  •  •  •  +  yQx(t  -  Q)  +  t*    (12-8) 

with  appropriate  coefficients.  The  term  "linear"  expresses  the  addi- 
tivity  of  the  components  of  (12-7)  or  the  linear  approximation  of 
(12-8)  or  both.  In  this  section  and  in  several  more,  we  shall  consider 
linear  systems  of  a  single  variable  x(t).  Linearity  in  the  second  sense 
(above)  is  very  handy,  because  in  linear  systems  the  number  of  lags 
in  (12-8)  and  the  values  of  the  yB  determine  whether  g(t)  oscillates, 
explodes,  or  damps;  the  initial  value  g(0)  determines  only  the  amplitude 
of  the  fluctuations.  In  nonlinear  systems  amplitude  and  type  are  not 
separable  in  this  way. 

We  shall  devote  Sees.  12.5  to  12.7  to  a  priori  trendless  systems;  then, 
in  Sec.  12.8,  we  shall  inquire  how  we  know  a  system  to  be  trendless 
and,  if  it  has  a  trend,  how  this  trend  can  be  removed. 

12.5.  Fluctuations  in  trendless  time  series 

A  trendless  or  a  detrended  time  series  can  be  random,  oscillating,  or 
cyclical.     It  is  random  if  it  can  be  generated  by  independent  drawings 
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from  a  definable  Urn  of  Nature.  It  is  cyclical  (or  periodic)  if  it  repeats 
itself  perfectly  every  12  time  periods;  it  is  oscillatory  if  it  is  neither 
random  nor  periodic. 

A  simple  trigonometric  function  like  sin  (2irt/Q)  or  sin  (2irt/Q)  -f  b 
sin  (2irt/ti)  is  strictly  cyclical.  The  combination  of  two  or  more  trigo- 
nometric functions  with  incommensurate1  periods  Qi,  fig,  .  .  .  [for 
instance,  x(t)  =  sin  (27r//fii)  +  «os  (27r//ft2)]  is  not  periodic  but  oscil- 
latory. Commensurate  periods  Qif  122,  .  .  .  appear  in  (12-7)  only  in 
the  trigonometric  terms  sin,  cos,  tan,  etc.,  and  not  as  multiplicative 
factors,  exponents,  etc. 

With  the  exception  of  purely  seasonal  phenomena  (which  are 
periodic),  economic  time  series  are  overwhelmingly  of  oscillating  type. 
Oscillations  arise  from  three  sources:  (1)  the  summation  of  non- 
stochastic  time  series  with  incommensurate  periods,  (2)  moving 
averages  of  random  series,  and.  (3)  autoregressive  systems  having  a 
stochastic  component. 

We  can  briefly  dispose  of  the  last  case  first.  If  x(t)  is  an  autogressive 
variable 

x(t)  -  aix(t  -  1)  +  '.  .  .  +  aHx(t  -  H)  +u<  (12-9) 

whose  systematic  part  would  damp  if  u  were  to  be  continually  zero, 
then  x(t)  can  be  expressed  as  a  weighted  moving  average  of  the  random 
disturbances,  and  so  the  third  case  reduces  to  the  second  case  above.2 
The  moving  average  of  a  raidom  series,  however,  oscillates!  This 
proposition,  the  Slutsky  proposition,  shocks  the  intuition  at  first  and, 
therefore,  deserves  some  discussion.  Let  us  take  a  time  series  so  long 
that  we  do  not  have  to  worry  about  any  shortage  of  material  to  be 
averaged  by  moving  averages.  Consider  now  a  moving  average 
spanning  P  of  the  original  periods.  To  facilitate  the  exposition,  let 
us  take  P  amply  large.  Now  the  original  series  u(t)y  if  it  is  random, 
should  itself  be  neither  constant  nor  periodic.  Because  if  it  is  constant, 
it  is  not  random.  And  if  it  is  periodic,  a  given  value  of  u  depends  on 
the  previous  one;  hence  u{t)  is  not  random.  A  truly  random  series  is 
neither  full  of  runs  and  patterns  nor  entirely  bereft  of  them.  Just  as  a 
true  die,  once  in  a  while,  produces  runs  of  sixes  or  aces,  so  a  random 

*Two  real  numbers  are  incommensurate  when  their  ratio  is  not  a  rational 
number. 

2  See  Kendall,  vol.  2,  pp.  406-407. 
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time  series  occasionally  exhibits  a  run.  For  the  sake  of  illustration, 
suppose  the  run  is  3  periods  long  and  Wioi  =  W102  =  Wios  =  10.  Now 
consider  what  happens  to  its  moving  average  in  the  neighborhood  of  the 
run.  Let  the  span  be  large  relative  to  the  run,  say,  P  =  17.  Then 
the  moving  average  has  a  run  (less  pronounced  and  more  tapered) 
19  periods  long — that  is  to  say,  from  the  time  that  the  right-hand  end 
of  the  span  includes  U101  to  the  time  that  its  left-hand  end  includes  wi03. 
A  moving  average  of  a  moving  average  of  a  random  series  oscillates 
even  more. 

These  simple  properties  are  vital  for  the  statistical  analysis  of 
business  cycles. 

In  the  first  place,  the  economic  system  itself  operates  somewhat  like 
a  moving  average  of  random  shocks:  consumers,  businesses,  govern- 
ments get  buffeted  around  by  random  external  and  internal  impulses, 
such  as  weather,  a  rush  of  orders,  a  rash  of  tax  arrears;  the  economy 
takes  most  of  these  things  in  its  stride;  it  does  not  adjust  instantane- 
ously and  completely  to  the  shocks,  but  rattier  cushions  and  absorbs 
them  over  considerably  larger  spans  than  their  original  duration. 
The  Slutsky  proposition  accounts  for  business  oscillations  as  the  result 
of  averaging  random  shocks. 

In  the  second  place,  even  if  the  economic  system  itself  does  no 
averaging,  statisticians  do.  The  national  income,  price  indexes,  and 
other  data  in  all  the  fact  books  are  averages  or  cumulants  of  one  sort  or 
another,  frequently  over  time.  Such  data  would  exhibit  oscillations 
even  if  the  economy  itself  did  not. 

Finally,  analysts  who  use  the  moving  average  technique  (on  other- 
wise flawless  data  from  an  economy  that  is  innocent  of  averaging) 
cither  for  detrending  or  for  any  other  purpose  may  themselves  intro- 
duce oscillations  into  their  charts  and  so  generate  a  business  cycle 
where  none  exists. 

12.6,  Correlograms  and  kindred  charts 

According  to  the  Slutsky  proposition,  if  we  want  to  analyze  a  time 
series  we  shall  be  well  advised  to  leave  it  unsmoothed  and  try  some 
direct  attack. 

It  is  natural  to  ask  first  whether  a  given  trendless  time  series  \(t)  is 
oscillating  or  periodic.     In  the  nonstochastic  case  the  question  can  be 
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quickly  settled  by  the  unaided  eye,  detecting  faithful  repetition  of  a 
pattern,  however  complicated.  In  the  stochastic  cases  the  faithful 
repetition  is  obscured  by  the  superimposed  random  effects  a:;fi  Iheir 
echoes,  if  any. 

Define  serial  correlation  of  order  6  as  the  quantity 

r0\  =       cov  fa,xt-e) 


(yarxtvarxt-o)** 


A  correlogram  is  a  chart  with  0  on  the  horizontal  axis  and  p(0)  or  its 
estimate  r(0)  on  the  vertical.  A  strictly  periodic  time  series  has  a 
periodic  correlogram  with  always  the  same  silhouette  and  the  same 
periodicity.  If  the  former  is  damped,  so  is  the  latter.  A  moving 
average  of  random  terms  has  a  damped  (or  damped  oscillating)  cor- 
relogram of  no  fixed  periodicity.  A  nonexplosive  stochastic  auto- 
regressive  system  like  (12-9)  is  a  damped  wave  of  constant  periodicity. 

Correlograms  are  not  foolproof.  They  may  or  may  not  identify 
correctly  the  type  of  model  to  which  a  given  time  series  belongs.  For 
instance,  if  the  random  term  in  (12-9)  is  relatively  large,  the  correlogram 
of  x(t)  will  compromise  between  the  strictly  periodic  silhouette  of  the 
exact  autoregressive  system  <xix(t  —  !)+•••+  <xnx(t  —  H)  and  the 
nonperiodic  silhouette  of  the  cumulated  random  terms  ur  -f  ciun-i  + 
•  •  •  -f-  aH~lu\.  In  general,  it  will  neither  damp  progressively  nor 
exhibit  any  fixed  periodicity.  This  is  very  unfortunate,  because,  from 
a  priori  theory,  we  expect  to  meet  such  time  series  often  in  economics. 

Businoss  cycle  and  stock  market  analysts  are  often  interested  in  turn- 
ing points  in  a  series  and  in  forces  bringing  about  these  turning  points 
rather  than  in  the  amplitude  of  the  fluctuations.  This  leads  naturally 
to  periodograms.  To  take  an  example  from  astronomy,  imagine  that 
the  time  series  \(t)  measures  the  angle  of  Mars  and  Jupiter  with  an 
observer  on  earth.  We  know  this  series  to  be  analyzable  into  four 
components:  the  revolutions  of  Earth,  Mars,  and  Jupiter  round  the 
sun  plus  the  minor  factor  of  the  earth's  daily  rotation.  Periodograms 
are  supposed  to  show,  from  evidence  in  the  time  series  itself,  the  four 
relevant  periods  fii  =  3G5.26  days,  fi2  =  687  days,  Q3  =  H-86  years, 
and  ft4  =  24  hours.  This  is  a  relatively  easy  matter  if  the  series  is 
nonstochastic,  if  we  know  beforehand  that  only  four  basic  periods  are 
involved,  or  both.  The  composite  series  fluctuates  and  undergoes 
accelerations,  decelerations,  and  reversals  occasioned  by  the  move- 
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mcnts  of  its  four  basic  components.    All  this  is  captured  by  the 
formulas 

2irs 


A  «  y  X(8)  cos  -—- 


where  12  is  an  unknown  period.  The  periodogram  is  a  chart  with  12 
on  the  horizontal  and  S2  on  the  vertical  axis.  The  value  S2  —  A2  +  B2 
attains  maxima  when  12  takes  on  the  values  Qi,  122,  123,  124.  The  tech- 
nique works  fairly  well  if  x(/)  is  indeed  composed  of  periodic  (trigo- 
nometric) terms  and  a  random  component.  It  works  very  badly 
when  x(0  is  autoregressive,  because  the  echoes  of  past  random  dis- 
turbances are  of  the  same  order  of  magnitude  as  the  smaller  periodic 
components  of  x(t)  =  ct\x(t  —  1)  -f-  •  •  •  -f  aux{t  —  H)  and  claim  the 
same  attention  as  the  latter  in  the  formula  for  S2.  Like  the  cor- 
relogram,  the  periodogram  fails  us  where  it  is  most  needed,  that  is, 
in  the  analysis  of  an  economic  time  series  which  we  know  to  be  auto- 
regressive and  stochastic  though  we  know  nothing  about  the  number 
and  size  of  its  12s. 

12.7.  Seasonal  variation 

The  easiest  periodic  components  to  measure  and  allow  for  are  those 
tied  to  astronomy.  We  know  that  the  cycle  of  rain  and  shine  repeats 
itself  every  365  days,  and  we  would  naturally  expect  this  to  be  reflected 
in  any  time  series  having  to  do  with  swim  suits,  umbrellas,  or  number 
of  eggs  laid  by  the  average  hen.  The  same  is  true  of  cycles  imposed  by 
custom  or  by  the  state,  for  instance,  the  seven-day  recurrence  of 
Sunday  idleness,  the  Christmas  rush,  the  preference  of  employees  for 
July  holidays.  In  all  these  cases  the  period  itself  is  known,  although 
it  may  be  complicated  by  moving  feasts,  the  varying  number  of  days 
in  a  month,  and  the  occasional  occurrence  of,  say,  a  short  month 
containing  four  Sundays  plus  Easter  or  a  Friday  the  thirteenth.  The 
problem  here  is  not  to  find  the  seasonal  period  but  its  profile. 

It  is  one  thing  to  recognize  and  measure  the  seasonal  profile  and 
another  to  remove  it.  Sometimes  we  want  to  do  the  former,  some- 
times the  latter,  depending  on  our  purpose. 
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If  the  purpose  is  to  forecast  cycles  and  trends,  it  is  a  false  axiom  that 
a  seasonally  adjusted  series  is  a  better  series.  The  only  time  we  are 
justified  in  taking  out  seasonal  fluctuations  is  when  we  believe  that 
businessmen  know  there  is  seasonality,  expect  it,  and  adjust  to  it  in 
a  routine  way,  either  consciously  in  a  microeconomic  way  or  in  their 
totality  when  many  millions  of  their  microeconomic  decisions  interact 
to  form  the  business  climate.  So,  for  forecasting  purposes,  it  is 
legitimate  to  wash  out  seasonal  movements  only  when  they  are  washed 
out  of  the  calculations  of  consumers  and  businessmen.  If  a  seasonal 
exists  but  people  have  not  detected  it,  it  should  be  left  in.  For 
instance,  if  it  were  true  that  the  stock  market  had  seasonal  properties 
unknown  to  its  traders,  they  should  not  be  corrected  for,  because  the 
participants  mistake  these  for  basic  trends  and  react  accordingly. 
Conversely  if  the  relevant  people  think  there  is  a  seasonal  when  in 
fact  none  exists,  its  imagined  effect  should  be  allowed  for  by  the  fore- 
caster of  trends.  Suppose,  as  an  example,  that  the  market  believes 
that  the  U.S.  dollar  falls  in  the  summer  relative  to  the  Canadian  and 
rises  in  the  winter.  This  imagined  seasonal  should  be  taken  'into 
account  in  analyzing  the  significance  of  monthly  or  quarterly  import 
orders.  To  deseasonalize  every  time  series  may  increase  knowledge  in 
all  cases,  but  it  increases  forecasting  accuracy  only  when  the  time  has 
come  when  the  market  has  learned  all  the  real  seasonals  and  imagines 
none  where  none  exist. 

Every  formula  either  for  measuring  seasonals  or  for  removing  them 
is  an  implicit  economic  theory,  which  may  be  appropriate  for  one 
economic  time  series  and  inappropriate  for  another.  For  Instance, 
treating  the  seasonal  as  an  additive  factor  implies  that  a  given  absolute 
deviation  from  some  normal  or  trend  is  equally  important  In  all 
months.  This  is  false  in  the  case  of,  say,  housing-construction  starts 
in  Labrador;  the  average  number  of  these  is,  let  us  assume,  4  in 
December  and  50  in  July.  Then  5  starts  in  December  is  a  more 
serious  departure  than  51  in  July.  However,  many  analysis  use 
additive  seasonals  for  each  and  every  time  series. 

If  the  genuine  seasonal  period  is  12  months,  its  profile  can  be  approxi- 
mated by  averaging  the  scores  of  several  Januaries,  then  several 
Februaries,  etc.  This  technique  gives  a  biased  estimate  of  the  seasonal 
profile  if  the  time  series  is  autoregressive,  unless  random  disturbances 
12  months  apart  are. independent.     To  see  this,  take  (for  simplicity 
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only)  the  one-lag  autorcgressive  model 

x(i)  =  ax(t  —  1)  +  7  sin  -^  +  ut 

and  let  0  represent  the  first  January  and  12  the  following  one.  For  sim- 
plicity, let  us  average  the  values  x(0)  and  a: (12)  of  just  Januaries. 
Then  we  have 

z(12)  =  a12x(0)  +  al2Ui  +  allu2  +•••■+  otUn  +  W12  +  7  sin  2ir 

which  involves  a  moving  sum  of  random  terms,  and  this  sum  oscillates, 
as  we  already  know  from  Sec.  12.5.  The  oscillation  due  to  the  random 
term  will  be  confounded  with  the  amplitude  of  the  true  seasonal. 
This  will  manifest  itself  in  two  ways:  either  the  seasonal  will  seem  to 
shift  or,  if  it  does  not  shift,  it  will  contain  the  cyclical  properties  of  the 
cumulated  random  effects. 

12.8.  Removing  the  trend 

Ultimately,  economic  theory  and  not  the  facts  tell  us  whether  the 
trend  (or  longest-term  movement)  is  linear  or  otherwise.  If  we  obtain 
the  trend  as  what  is  left  after  cycles  and  seasonals  have  been  taken  out, 
the  trend  inherits  all  the  diseases  and  pitfalls  of  the  seasonals. 

In  particular,  if  we  use  a  moving  average  to  obtain  the  trend,  we  are 
almost  certain  to  get  it  wrong.  To  see  this,  suppose  that  we  have  a 
trendless  cyclical  and  stochastic  phenomenon,  say 

xt  =  sin  —  -f  ut 

depicted  in  Fig.  24.  If  the  span  P  of  the  moving  average  is  longer 
than  the  true  period  0,  then  the  moving  average  (dashes  in  Fig.  24) 
exaggerates  the  oscillations  and  imposes  a  long  wavy  trend  where  none 
existed.     Or  again,  if  the  system  is  autoregressive  and  trendless, 

s(0  -  <*ix(t  -  1)  ■+  •  •  •  +  otnx(t  -  H)+ut 

the  moving  average  of  the  random  term  contributes  its  oscillations  to 
the  systematic  ones  and,  by  the  same  process  as  that  shown  in  Fig.  24, 
imposes  a  long,  wavy  trend.     Naturally,  distortions  like  these  arise 
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when  x(t)  truly  contains  some  systematic  trend.    Moving  averages 
distort  both  the  trend  and  the  cycles. 

The  variate  difference  method  eliminates  trends  on  the  ground  that 
any  trend  can  be  approximated  by  a  polynomial  of  some  degree  N 
and  that  such  a  polynomial  can  be  brought  down  to  zero  after  N  +  1 
differentiation??.     Therefore,  let 

x{t)  -  7o  +  yit+  •  •  •  +  yst"  +  f(t)  +  ut  (1240) 

where  f(t)  and  ut  are  the  cyclical  and  random  factors.    The  method 


Fig.  24 

proceeds  as  follows: 

1.  Difference  (12-10)  once: 

x(t  -  1)  -  70  +  7i(*  -  1)  +  •  •  ' 

+  y*tN~l  +  f(t  -  l)  +  w<-t   <is-ii) 

2.  Subtract  (12-11)  from  (12-10)  and  call  y(t)  the  new  variable 
x(t)  —  x(t  —  1).  We  do  not  need  to  write  out  y(t)  in  full  Ml  fiote 
only  that  its  trend  is  a  polynomial  of  one  degree  less  than  th§  poly- 
nomial in  (12-10)  and  that  its  random  component  is 

vt  =  ut  -  ut-i  (tS-12) 

3.  Do  the  same  for  y(t),  and  define  z(t)  =  y(t)  —  y(t  —  1);  this  too 
reduces  the  power  of  the  trend  and  generates  a  random  component 


wt 


vt  —  vt-i  =  ut  —  2ut-\  +  Ut-2 


(12-13) 


4.  Continue  in  this  fashion  as  long  as  the  estimated  covafiances 
mXX)  rriyy/2,  m„/6  decrease.  (The  correcting  denominators  are  dis- 
cussed below.) 

To  see  what  is  going  on,  consider  the  first  quadrant  of  Fig.  25,  whore 
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x(t)  was  taken  to  be  a  second-degree  polynomial  of  L  Then  y(t)  is  a 
sloping  straight  line,  and  z(t)  is  a  level  one.  The  variance  of  x(t)  is 
quite  high,  because  x  assumes  many  widely  different  values  as  t  changes. 
The  variance  of  y(t)  is  smaller,  because,  though  y  varies,  it  varies  more 
smoothly  than  x.  And  z  does  not  vary  at  all.  The  variate  difference 
method  reduces  tho  trend  to  z,  and  any  remaining  variation  in  the 
resulting  scries  must  be  due  to  nontrend  components. 

Several  things  are  wrong  with  this  method.  First,  if  x  extends  to 
the  second  quadrant  of  Fig.  25,  say,  symmetrically,  its  covariation 
with  its  lagged  values  may  be  very  small  or  even  zero.     And,  in  general, 


Fig.  25 


a  high-degree  polynomial,  because  it  twists  and  turns  up  and  down, 
may  exhibit  a  smaller  lag  covariance  than  a  low-degree  polynomial. 
Hence  we  should  faithfully  carry  on  successive  differencing  in  spite  of 
a  drop  in  the  series  mxx,  mvv,  mM.  But  suppose  we  do.  How  are  we  to 
tell  when  the  polynomial  of  unknown  degree  has  finally  died  down? 
For  meanwhile,  as  (12-12)  and  (12-13)  show,  we  are  performing 
moving  averages  of  the  cyclical  component  and,  for  all  we  know,  this 
component  may  increase  or  decrease.  Finally,  the  variate  difference 
method  cannot  come  to  any  stop  if  its  cyclical  component  has  a  short 
lag.  For  instance,  the  first  differences  of  1,  —  1,  1,  —  1,  .  .  .  are 
2,  —2,  2,  —2,  .  .  .  ,  and  tho  first  differences  of  the  latter  are  4,  —4, 
4,  —4,  and  so  on. 

Now  a  word  about  the  correcting  denominators.     If  ut  itself  is  serially 
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uncorrelated,  then,  from  (12-12),  the  variance  of  vt  is  twice  that  of  utf 

since 

var  vt  =  cov  (utut)  -  2  cov  (w<,W/-i)  +  cov  (u,_i,u«_i) 

=  cov  (uhut)  H-  cov  (ut-i,ut-.i)  ■  2  COV  (f#|t«i) 
Similarly, 

var  wt  =  var  (w*  -  2w<„i  +  W/-2) 

=  var  ut  +  2  var  w<_i  +  var  w<_j  =  0  var  t*i 

and  so  on  for  higher-order  differences. 

In  my  opinion,  all  these  methods  for  detecting  or  eliminating  the 
trend  have  serious  imperfections.  The  way  out  is,  as  usual,  to  specify 
the  algebraic  form  of  the  trend,  the  number  of  cyclical  components 
acting  on  it,  to  make  stochastic  assumptions,  and  to  maximize  the 
likelihood  of  the  sample.  The  procedure  is  very  laborious;  it  is 
generally  biased,  but  efficient.  I  think  it  represents  the  best  we  can 
ever  do,  and  I  am  condemning  the  other  methods  only  if  they  are 
pretentiously  paraded  as  scientific.  I  do  admit  them  as  approxima- 
tions to  the  ideal. 

12.9.  How  not  to  analyze  time  scries 

The  National  Bureau  of  Economic  Research  has  attracted  a  great 
deal  of  attention  with  its  large-scale  compilation  and  analysis  of 
business  cycle  data.  The  compilation  is  done  with  such  care,  tenacity, 
and  love  as  to  earn  the  gratitude  of  all  users  of  statistics.  The  analysis, 
however,  has  often  been  questioned.     It  proceeds  roughly  as  follows: 

1.  Define  a  reference  cycle  for  all  economic  activity.  This  is  a 
conglomerate  of  the  drift  of  several  time  series,  accorded  various 
degrees  of  importance. 

2.  Remove  seasonal  variations  from  the  given  series,  say,  carload- 
ings  or  business  failures. 

3.  Divide  the  given  series  into  bands  corresponding  to  the  reference 
cycle. 

4.  Within  each  band  express  each  January  reading  as  a  per  cent  of 
the  average  January  in  the  band,  and  so  on  to  December. 

5.  In  each  of  the  resulting  specific  cycles  recognize  nine  typical 
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positions  or  phases.  The  latter  may  be  widely  spaced,  like  an  open 
accordion,  in  a  long  specific  cycle  or  tightly  in  a  short  one.  The  result 
is  now  considered  to  be  the  business  cycle  in  carloadings,  and  constitutes 
the  raw  material  for  forecasting,  for  computing  the  irregular  effects, 
and  for  checking  whether  the  given  series  can  be  said  to  have  its  typical 
periodicity,  amplitude,  etc.  There  are  variations  of  the  procedure, 
some  ad  hoc.  After  what  I  said  earlier  in  the  chapter  about  the 
pitfalls  of  time  series,  I  shall  not  make  any  further  comment  on  the 
National  Bureau's  method.  Recently,  electronic  computations  have 
been  programmed,  mainly  for  removing  the  seasonal.1  As  they 
involve  the  use  of  several  layers  of  moving  averages,  they  are  not 
altogether  safe  in  the  hands  of  an  analyst  ungrounded  in  mathematical 
statistics;  since,  however,  the  seasonal  is  the  least  likely  to  cause  harm 
(after  all,  the  period  is  correct),  we  may  set  this  question  aside. 

12.10.   Several  variables  and  time  series 

In  Sees.  12.1  to  12.9  we  have  considered  variables  that  move  in 
time  subject  to  shocks  and  to  laws  of  motion  unconnected  with  any 
other  variables.  It  hardly  needs  stressing  that  endogenous  economic 
variables  are  not  of  this  kind,  since  all  of  them  are  generated  jointly  by 
the  workings  of  the  economic  system.  One  wonders  of  what  use  is  the 
analysis  of  individual  time  series  despite  the  heavy  apparatus  of 
correlograms,  pcriodograms,  and  variate  differences. 

Assuming  that  several  economic  variables  hang  together  structurally, 
what  kinds  of  time  series  do  they  manifest?  Sections  12.11  and  12.12 
discuss  this  problem.  If  several  economic  variables  are  unconnected, 
how  docs  a  given  combination  of  them  behave?  The  answer  to  this 
question  (Sec.  12.13)  provides  a  null  hypothesis  for  judging  the  effec- 
tiveness of  averages,  sums,  and  a  variety  of  business  indicators,  like 
the  National  Bureau  of  Economic  Research  "cyclical  indicators"  and 
"diffusion  indexes"  (Sec.  12.13).  The  converse  problems  are  also  of 
great  importance  to  the  progress  of  business  cycle  research,  because 
consideration  of  individual  time  scries  may  enable  us  to  infer  the  nature 
of  the  economic  system  without  laboriously  estimating  each  structural 
equation  by  the  methods  of  Chaps.  1  to  9. 

1  See  Julius  Shiskin,  Electronic  Computers  and  Business  Indicators  (Occasional 
Paper  57,  New  York:  National  Bureau  of  Economic  Research,  1957). 
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12.11.  Time  series  generated  by  structural  models 

What  kinds  of  time  series  are  generated  when  the  two  variables  x 
and  y  are  structurally  related?  We  shall  take  up  this  question  first  for 
nonstochastic  relations  and  then  for  stochastic  relations  under  various 
simplifying  assumptions.     All  our  models  will  be  complete. 

If  the  model  is  completely  nonlagged,  like  the  usual  skeleton  business 
cycle  model 

r.cif         <"-"> 

with  investment  taken  as  exogenous,  then  the  time  series  for  consump- 
tion and  income  have  the  same  shape  as  the  series  for  investment,  as 
can  be  seen  from  the  reduced  form: 

(1  -  p)C  -  a  +  pi 

(i-«r-«  +  / 

In  this  example  the  agreement  is  not  only  in  the  timing  of  turns  but 
in  the  phase  as  well,  because  investment,  consumption,  and  income  are 
positively  related.     In  a  more  extended  model 

C  =  a  +  pY 

I  «  y  +  &  (12-15) 

Y « C+/+G 

where  investment  is  endogenous,  government  expenditure  is  exogenous, 
and  investment  is  discouraged  by  the  latter  (5  >  0),  all  time  series  will    $ 
coincide  on  timing;  but  when  G  grows  /  falls,  and  C  and  Y  will  fall  if  5 
is  less  than  —  1  and  will  rise  if  it  is  greater. 

If  (12-14)  and  (12-15)  are  made  stochastic,  all  endogenous  variables 
absorb  some  of  the  random  disturbances.  The  random  disturbances 
apportion  themselves,  one  year  with  another,  according  to  a  fixed 
pattern  among  the  endogenous  variables.  For  instance,  if  u  is  the 
random  disturbance  of  the  consumption  function  and  v  of  the  invest- 
ment function,  the  reduced  form  of  (12-15): 

(1  -  p)C  =  (a  +  py)  +  ((3  +  p8)G  +  U+     $V 

(1  -  P)I  =  (t  ~  Py)  +  («  -  P8)G  +  (1 -  P)v       (12-16) 

(1  -  p)Y  =     (a  +  7)   +   (l  +  W  +  «     +    v 
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shows  that  the  fluctuations  and  irregular  components  are  in  step, 
though  they  may  differ  in  their  amplitudes. 

The  variances  of  the  three  irregular  components  in  (12-16)  are 
proportional  respectively  to  <ruu  +  2/?<ruv  +  0V™,  (1  —  P)<rw,  and 
<ruu  +  2<rUv  +  (Tvv  Thus,  if  the  two  random  terms  are  positively 
correlated  (<ruv  >  0),  income  wobbles  more  than  consumption  and 
consumption  more  or  less  than  investment,  depending  on  the  size  of 
the  marginal  propensity  to  consume. 

Let  us  now  consider  as  a  recursive  model  the  market  for  fish.  The 
men  go  to  their  boats  with  today's  price  in  their  minds,  expecting  it  to 
prevail  tomorrow,  and  work  hard  if  the  price  is  high.  Thus  tomorrow's 
supply  depends  on  today's  price  plus  weather  (z).  Should  the  price 
fall,  the  fishermen  don't  put  the  fish  back  into  the  sea;  so  at  the  end  of 
the  day  all  the  fish  is  sold.     Demand  is  ruled  by  current  price  only. 

d  =  a  +  0p  +  u 

S   »   y  +   dpi  +  €2  +  V  (12-17) 

s  =  d 

The  model  can  be  solved  for  p  as  follows: 

a  +  fipt  -f  ut  =  7  +  Spt-i  +  €2<  +  vt 

which  shows  that  price  tends  to  zigzag  (0  negative,  8  positive),  falling 
with  good  weather  and  rising  with  bad,  as  we  might  expect.  In 
(12-17),  unlike  case  (12-10),  the  irregular  components  of  the  price  and 
quantity  time  series  are  no  longer  constant  multiples  of  each  other,  nor 
are  they  in  step.  This  is  so  because  randomly  overeager  demand 
{u  >  0)  affects  not  only  today's  price  but,  through  its  effect  on  the 
fishermen's  efforts,  contributes  to  a  fall  in  tomorrow's  price  as  well. 

The  connections  among  phase,  amplitude,  and  irregularity  in 
structurally  related  time  series  become  very  complicated  as  we  increase 
the  number  of  variables  and  as  we  admit  more  and  more  lags  and  cross 
lags.  In  any  representative  set  of  economic  time  series  it  would 
Indeed  be  a  marvel  if  closely  similar  patterns  emerged,  except  between 
such  series  as  sales  of  left  and  of  right  shoes.  And  yet  the  marvel 
scorns  to  happen. 
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12.12.  The  over-all  autoregression  of  the  economy 

Regardless  of  which  came  first,  chickens  and  eggs  in  the  long  run 
have  similar  time  scries,  because  there  can  bo  no  chicken  without  a 
previous  combination  of  egg  and  chicken  and  there  can  be  no  egg 
without  a  previous  chicken.  Since  the  hatching  capacity  of  a  hen  is 
fixed,  say,  10  chicks  per  hen,  and  since  the  chicken-producing  capacity 
of  an  egg  is  also  fixed,  say  1  to  1,  the  cycles  in  the  egg  population  and 
in  the  hen  population  cannot  possibly  fail  to  exhibit  a  likeness — though, 
in  particular  short-run  instances,  random  disturbances  like  Easter  or  a 
fox  can  grievously  misshape  now  the  one,  now  the  other  series.  Orcutt 
has  claimed1  that  something  like  this  is  true  of  the  time  series  of  the 
economy's  endogenous  variables.  He  states  that  the  autoregressive 
relation 

xt+i  =  l.Sxt  -  O.Sxt-i  +  ut+i 

fairly  describes  the  body  of  variables  used  by  Tinbergen  in  his  pioneer- 
ing analysis  of  American  business  fluctuations.2  Orcutt's  result,  if 
correct,  would  not  exactly  spell  the  end  of  structural  estimation  of 
econometric  models,  because  the  latter  may  bo  more  efficient,  less 
biased,  etc.  However,  if  a  correct  autoregression  were  discovered,  it 
would  certainly  short-circuit  a  good  deal  of  current  research. 

Orcutt's  theorem  holds  only  for  systems  whose  exact  part,  by  itself, 
is  stable  and  nonexplosive.  Orcutt  also  found  that  we  can  get  better 
estimates  of  the  over-all  autoregression  if  we  consider  many  time  series 
simultaneously  than  if  we  consider  them  one  at  a  time.  This  follows 
from  the  fact  that  Easter  and  foxes  descend  on  eggs  and  hens  inde- 
pendently, so  that  a  grievous  random  dent  in  the  egg  population  tends 
to  be  balanced  by  the  relative  regularity  of  the  hen  population. 

In  the  absence  of  random  shocks,  all  the  interdependent  variables 
have  the  same  periodicity  but  different  timing,  amplitudes,  and  levels 
about  which  they  fluctuate.  With  random  shocks,  the  periodicities 
are  destroyed  more  or  less  depending  on  the  severity  of  the  shocks  and 
their  incidence  on  particular  variables.     The  unaided  eye  can  seldom 

1  Reference  ia  in  Further  Readings  at  the  end  of  the  chapter. 

2  Jan  Tinbergen,  Statistical  Testing  of  Business-Cycle  Theories  (Geneva:  League 
of  Nations,  1939). 
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recognize  the  true  periodicity.  A  highly  sophisticated  technique  can 
screen  out  the  autoregressive  structure  by  combining  observations 
from  all  time  series,  but  it  is  so  difficult  to  compute  that  one  might  as 
well  specify  a  model  in  the  ordinary  way. 

Exerciser 

12.A  Let  g(t)  and  k(t)  be  the  population  of  gnus  and  of  kiwis. 
Let  fa  and  5»  be  the  age-specific  birth  and  death  rates  for  gnus  and  aj 
and  7/  for  kiwis.  Disregard  the  question  of  the  sexes.  Let  e  and  £ 
stand  for  input-output  coefficients  expressing  the  necessary  number 
of  kiwis  a  gnu  must  eat  to  survive,  and  conversely.  Construct  a 
model  of  this  ecological  system.  Do  something  analogous  for  new 
cars  and  used  cars. 

12. B  In  a  Catholic  region,  say,  Quebec,  the  greater  the  number 
of  priests  and  nuns,  other  things  being  equal,  the  smaller  the  birth 
rate,  because  the  clergy  is  celibate.  But  the  more  numerous  the 
clergy,  other  things  being  equal,  the  higher  the  birth  rate  of  the  laity, 
because  of  much  successful  preaching  against  birth  control.  Construct 
an  ecological  model  for  such  a  population. 

12. C  The  more  people,  the  more  lice,  because  lice  live  on  people. 
But  the  more  lice,  the  more  diseases  and,  hence,  the  fewer  people. 
Construct  the  model,  with  suitable  life  spans  for  the  average  louse 
and  human. 

12.D  According  to  the  beliefs  of  a  primitive  tribe,  lice  are  good  for 
one's  health,  because  they  can  be  observed  only  on  healthy  people. 
(Actually  the  lice  depart  from  the  sick  person  because  they  cannot 
stand  his  fever.)  Construct  this  model  and  compare  with  Exer- 
cise 12,0. 


12.13.  Leading  indicators 

An  economic  indicator  is  a  sensitive  messenger  or  representative  of 
other  economic  phenomena.  We  search  for  indicators  in  the  same 
spirit  in  which  pathology  examines  the  tongue  and  measures  the  pulse: 
for  quickness,  cheapness,  and  to  avoid  cutting  up  the  patient  to  find 
out  what  is  wrong  with  him. 

A  timing  indicator  is  a  time  series  that  typically  leads,  lags,  or 
coincides  with  the  business  cycle.  Exactly  what  this  means  will 
occupy  us  later.     We  shall  deal  only  with  the  leading  indicators. 
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From  what  was  said  in  Sec.  12.12,  it  comes  as  no  surprise  that  certain 
economic  time  series,  like  Residential  Building  Contracts  Awarded, 
New  Orders  for  Durable  Goods,  and  Average  Weekly  Hours  in  Manu- 
facturing, should  have  a  lead  over  Disposable  Income,  the  Consumer 
Price  Index,  and  so  forth.  The  difficult  questions  are  (1)  how  to 
insulate  the  cyclical  components  of  each  series  from  the  trend,  seasonal, 
and  irregular;  (2)  how  to  tell  whether  leads  in  the  sample  period  are 
genuine  rather  than  the  cumulation  of  random  shocks;  and  (3)  where 
phases  are  far  apart,  how  to  make  sure  that  carloadings  lead  disposable 
income  and  not  conversely,  or  that  the  Federal  discount  rate  does  lead 
and  direct  the  money  supply  and  not  try  belatedly  to  repair  past 
mistakes.  I  am  sure  that,  ultimately,  one  has  to  fall  back  on  economic 
theory;  one  is  forced  to  specify  bits  and  pieces  of  any  autoregressive 
econometric  model,  because  no  amount  of  mechanical  screening  of  the 
time  series  themselves  can  answer  the  third  question  convincingly. 

In  30  years  of  research  the  National  Bureau  of  Economic  Research 
has  isolated  about  a  dozen  fairly  satisfactory  leading  indicators  out  of 
800-odd  time  series.1  I  think,  however,  that  in  nearly  all  cases, 
a  priori  considerations  would  have  led  to  the  selection  of  these  leading 
series  without  the  laborious  wholesale  analysis  of  hundreds  and 
hundreds  of  time  series.  For  instance,  Average  Hours  Worked  in 
Manufacturing  is  a  good  candidate  for  leading  indicator  of  manu- 
facturing activity  because  we  know  from  independent  observation 
that  it  is  easier  for  a  business  establishment  to  take  care  of  a  moderate 
increase  in  orders  by  overtime  than  by  hiring  new  workers  and  easier  to 
tide  over  a  lull  by  putting  its  workers  on  short  time  than  by  laying  some 
off  at  the  risk  of  losing  them.  All  the  sensible  leading  indicators 
thrown  up  in  the  National  Bureau's  screening  are  obvious  in  a  similar 
way.  An  oddity  like  the  production  of  animal  tallow,  which  is  said 
to  lead  better  than  many  other  series,  could  not  have  been  discovered 
by  a  priori  reasoning,  but  neither  is  it  used  by  any  sane  forecaster,  for  a 
good  empirical  fit  is  no  substitute  for  a  sound  reason. 

Part  of  the  findings  of  the  National  Bureau  are,  I  think,  tautological, 
because  the  timing  indicators  lead,  lag,  and  coincide  not  with  §ach 
other  individually,  but  with  the  reference  cycle,  which  is  an  index  of 

1  See  Geoffrey  H.  Moore,  Statistical  Indicators  of  Cyclical  Revivals  and  Hefem'ons 
(Occasional  Paper  31,  New  York:  National  Bureau  of  Economic  Research,  1950), 
particularly  chap.  7  and  appendix  B. 
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"general  business  activity."  The  latter  is  a  vague  conglomerate  of 
employment,  production,  price  behavior,  and  monetary  and  stock 
market  activity;  therefore,  it  is  no  wonder  at  all  that  some  series  lead, 
some  coincide  with,  and  others  lag  behind  it.  The  reference  cycle  is  a 
useful  summary,  but  we  should  not  be  misled  into  existential  fallacies 
about  it. 

12.14.  The  diffusion  index 

A  diffusion  index  is  a  number  stating  how  many  out  of  a  given  set  of 
time  series  are  expanding  from  month  to  month  (or  any  other  interval). 
Diffusion  indexes  can  be  constructed  from  any  set  of  series  whatsoever 
and  according  to  a  variety  of  formulas,  of  which  I  shall  discuss  just 
three. 

There  are  two  reasons  why  one  might  want  to  construct  a  diffusion 
index.  One  is  the  belief  that  a  business  cycle  starts  in  some  corner  of 
the  economic  system  and  propagates  itself  on  the  surrounding  territory 
like  a  forest  flro.  This  says  in  effect  that  the  diffusion  index  is  a  cheap 
short-cut  ftutorcgrcssive  econometric  model.  The  second  reason  is 
that  the  particular  formula  used  to  construct  the  index  captures  in  a 
handy  way  the  logic  of  economic  behavior. 

Three  different  formulas  have  been  suggested  for  the  diffusion  index: 

Formula  A        Per  cent  of  the  series  expanding 
Formula  B        Per  cent  of  the  series  reaching  turns 
Formula  C         Average   number    of    months   the   series   have   been 
expanding 

Research  by  exhaustion  argues  that  we  ought  to  try  all  these 
formulas  on  all  time  series  and  choose  the  formula  that  gives  prag- 
matically the  best  results.  This  can  be  done  quite  cheaply  on  the 
Univac.  I  think  such  a  procedure  will  frustrate  our  search  for  good 
indicators,  because  each  formula  embodies  a  different  theory  of 
economic  behavior,  not  universally  suitable. 

Formula  A  is  justified  by  the  classical  type  of  business  cycle,  where 
income,  employment,  prices,  hours,  inventories,  production,  and  so  on, 
and  their  components  move  up  and  down  in  rough  agreement  or  with 
characteristic  lags.  Suppose,  however,  that  the  authorities  control 
totals — employment,   some   price  index,   credit,   or  the  balance  of 
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payments.  The  result  is  " rolling  readjustment"  rather  than  cycles. 
Formula  A  has  lost  its  relevance.  In  a  world  of  rolling  readjustment 
this  formula  will  show  an  uneventful  record  and  will  not  be  able  to 
indicate,  much  less  predict,  sectional  crises  hiding  under  a  calm  total. 

Formula  B  is  justified  if  consumers  and  business  are  more  sensitive  to 
turns,  however  mild,  than  to  accelerations,  however  violent.  Invest- 
ment  plans  are  likely  to  be  of  this  kind.  As  long  as  there  is  expansion* 
in  demand,  any  overexpansion  will  be  made  good  eventually.  If  there 
is  contraction,  however  small,  the  mistake  is  more  obvious,  and  panic 
may  easily  result.  On  the  other  hand,  there  are  many  areas  in  both 
the  consumer  and  the  business  sectors  where  small  turns  are  not  taken 
seriously.  Formula  B,  therefore,  can  be  used  to  best  advantage  in 
studying  certain  investment  series  (like  railways'  orders  of  rolling 
stock)  but  is  counterprescribed  elsewhere. 

Formula  C  gives  great  emphasis  to  reversals.  Take  a  component 
that  has  been  slowly  expanding  for  some  months,  then  turns  down 
briefly.  Formula  C  registers  (for  this  component)  I?  2,  3,  etc,  up  to  a 
larjjjc  positive  number,  then  —1  (for  the  first  month  of  contraction), 
The  more  sustained  the  expansion,  the  more  violently  does  the  formula 
register  a  halt  or  small  reversal.  This  formula,  then^  Is  appropriate 
where  habit  and  momentum  play  an  important  part.  Where  could 
we  possibly  want  to  apply  it? 

Hire-purchase  may  be  related  to  disposable  income  Sn  some  way 
that  agrees  with  the  logic  of  formula  C.  Suppose  that  small  increases 
in  income  go  into  down  payments  and  time  payments  iof  morci  and 
more  gadgets;  if  so,  a  small  fall  in  income  would  put  a  complete  stop 
to  new  hire-purchase,  because  the  family  would  continue  tti  oontraottial 
time  payments  on  the  old  gadgets  and  would  not  be  llkc'o  to  mit 
into  food,  clothing,  and  recreation  to  buy  new  gadgets,  This  Is  a 
theory  of  consumer  behavior,  and  formula  C  is  a  convenient  way  to 
express  it  short  of  an  econometric  equation. 

Exercises 

12.E  Construct  diffusion  indexes  by  each  formula  from  the  two 
time  series  below: 


Series  1 

100, 

96, 

90, 

96, 

97, 

95, 

97 

Series  2 

100, 

99, 

102, 

100, 

100, 

101, 

10? 
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and  compare  the  cyclical  behavior  of  the  indexes  with  that  of  the 
sum  of  the  two  series. 

12.F  In  Exercise  12.E,  series  2,  replace  the  99  by  101,  and  con- 
struct formula  A.  Must  turning  points  in  the  sum  be  preceded  by 
turning  points  in  the  index? 

12. G  Construct  an  example  to  show  that  an  index  according  to 
formula  C  can  be  completely  insensitive  to  the  sum  of  the  component 
series. 

12.11  Show  by  example  the  converse  of  Exercise  12.G,  namely, 
that  swings  in  the  d illusion  index  formula  C  need  not  herald  turns 
(or  any  change  whatsoever)  in  the  sum  of  the  component  series. 


12.15.  Abuse  of  long-term  series 

One  unfortunate  by-product  of  time  series  analysis  is  that  it  requires 
longtime  series  with  which  to  work,  and  several  research  organizations 
have  responded  enthusiastically  to  the  challenge. 

For  example,  I  have  heard  urgings  that  we  construct  Canadian 
historical  statistics  for  the  purpose  of  sorting  out  timing  indicators,  on 
the  ground  that  what  took  the  National  Bureau  30  years  can  now  be 
done  in  30  hours  electronically.  I  think  this  kind  of  work  quite  futile, 
for  a  few  moments'  reflection  will  convince  us  of  its  negative  results. 
The  Canadian  economy,  compared  with  the  American,  is  small  and 
relatively  unbalanced;  therefore,  Canadian  historical  statistics  will 
have  a  very  large  irregular  component,  which  will  overwhelm  the  lino 
structural  relationships  we  want  to  uncover.  The  Canadian  economy, 
being  open,  responds  to  impulses  from  abroad;  therefore,  even  if  we 
had  good  domestic  historical  time  series,  our  chances  of  finding  among 
them  good  indicators  are  slim.  We  also  know  that  the  Canadian 
economy  is  "administered"  (it  has  more  governments  per  capita  than 
we  have  and  moro  industrial  concentration);  so  the  developments  that 
are  foreshadowed  by  the  indicators  are  likely  to  be  anticipated  by  the 
big  policy  makers,  with  the  result  that  predictions  go  foul.  We  know 
that  Canada  is  and  will  be  growing  fast  and  that  the  past  (on  which  all 
indicators  rely)  will  not  be  a  dependable  guide* 

My  guess  is  that  the  earliest  useful  year  for  time  series  on  bread 
baking  is  somewhere  around  1920.  For  iron  ore  shipments  it  is  1947, 
the  year  when  certain  Great  Lakes  canals  were  deepened.     However, 
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for  housing  demand  as  a  function  of  family  formation,  many  decades 
or  even  centuries  might  prove  to  contain  valid  information. 

There  are  many  good  reasons  why  we  might  want  to  construct 
uniformly  long  historical  statistics,  but  certainly  the  needs  of  cycle 
forecasting  is  not  one  of  them. 

12.16.  Abuse  of  coverage 

An  unfortunate  by-product  of  diffusion  index  analysis  is  that  it 
encourages  the  construction  of  complete  sets  of  data  when  incomplete 
ones  would  be  more  satisfactory.  This  is  so  because  the  timing  and 
irregular  features  of  the  diffusion  index  change  with  the  number  of 
scries  included  in  it. 

Let  us  suppose  that  we  want  to  forecast  industrial  production  by 
means  of  average  weekly  hours  worked;  the  series  rationalised  in 
Sec.  12. 13  is  a  possible  leading  indicator.  If  hours  worked  come  broken 
down  by  industry,  we  suspect  we  might  do  better  if  we  use  a  diffusion 
index  of  the  basic  series  rather  than  the  over-all  average. 

Now  our  first  impulse  is  to  look  at  the  published  series  for  Hours 
Worked  and  make  sure  that  they  give  complete  coverage  by  industry 
and  by  locality  and  that  the  series  have  no  gaps  in  time.  After  all, 
we  want  to  forecast  for  all  industry  and  for  the  entire  country.  Yet  it 
is  unreasonable  to  desire  full  coverage. 

First,  some  industries  employ  labor  as  a  fixed,  not  a  variable,  input. 
A  generating  station,  if  it  is  operated  at  all,  is  tended  by  a  switchman 
24  hours  a  day,  regardless  of  its  output.  Labor  is  uncorrelated  with 
output.  Here  is  a  case  where  coverage  does  harm  to  our  forecast, 
because  it  introduces  two  uncorrelated  variables  on  each  side  of  the 
scatter  diagram,  so  to  speak. 

Second,  in  the  service  industries,  the  physical  measure  of  output  w 
labor  input,  because  this  is  how  the  compilers  of  government  Statistics 
measure  the  production  of  services.  If  we  insist  on  coverage  of  the 
services,  we  get  trivial  correlations,  not  good  forecasts. 

Third,  during  retooling  it  is  possible  to  have  long  working  hours  and 
no  industrial  production.  What  should  we  do?  Throw  out  entirely 
any  industry  that  has  retooling  periods?  Not  at  all.  It  is  enough  to 
suppress  temporarily  from  consideration  the  data  for  this  industry 
until  the  experts  tell  us  that  retooling  and  catching  up  on  backlog  are 
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over.  A  deliberate  time  gap  in  the  statistics  improves  them.  This 
method,  though  it  appears  to  be  wasting  information,  actually  uses 
more,  for  it  includes  the  fact  that  there  has  been  a  retooling  period. 
The  fact  that  the  diffusion  index  is  dehomogcnized  is  a  flaw  of  a  second 
order  of  importance. 

In  this  way  we  select  statistics  of  average  hours  worked  to  use  in 
forecasting  industrial  production  which  are  a  statistician's  nightmare: 
they  have  time  gaps,  they  are  unrepresentative,  and  they  do  not 
reconcile  with  national  accounts  Labor  Income  when  they  are  multi- 
plied by  an  average  of  wage  rates. 

Similarly,  for  statistics  that  are  most  useful  in  forecasting,  it  is 
not  necessary  that  they  be  classifiable  into  grand  schemes,  such  as 
the  National  Income,  Moneyflows,  or  Input-Output  Tables.  The 
Canadians  plan  to  start  compiling  data  on  Lines  of  Credit  agreed 
upon  by  chartered  banks  and  their  customers  but  not  yet  credited  to 
the  customer's  account.  Such  a  series,  I  think,  will  prove  a  better 
predictor  than  the  present  one,  Business  Loans.  Now,  if  we  had 
information  on  Lines  of  Credit,  it  would  not  fit  any  existing  global 
scheme  and  would  not  become  any  more  useful  if  it  did.  For  forecast- 
ing purposes,  I  see  no  excuse  for  creating  a  matrix  of  Inter-sector 
Contingent  Liabilities  or  for  constructing  a  Balance  of  Withdrawable 
Promises  account. 

12.17.  Disagreements  between  cross-section  and  time 
series  estimates 

It  is  very  puzzling  to  find  that  careful  studies  of  the  consumption 
function  derived  from  time  series  give  a  significantly  larger  value  for 
the  marginal  propensity  to  consume  than  equally  competent  studies  of 
cross-section  data.  Three  kinds  of  explanations  are  available:  (1) 
algebraic  and  (2)  statistical  properties  of  the  model  explain  the  dis- 
crepancy; and  (3)  cross-section  data  and  time  series  data  measure 
different  kinds  of  behavior.  We  shall  concentrate  on  explanations 
1  and  2  in  order  to  show  that  algebra  and  statistics  alone  account  for 
much  of  the  difference  and  that  to  this  extent  explanations  of  the  third 
category  are  redundant. 

Cross-section  data  are  figures  of  income,  consumption,  etc.,  by 
individual  families  in  a  given  fixed  time  period.     Time  series  are  data 
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about  a  given  family's  consumption  and  income  through  time  or  about 
national  consumption  and  income  through  time. 

Algebraic  differences 

The  shape  of  the  consumption  function  can  breed  differences.    If  th© 
family  consumption  function  is  nonlinear,  say, 

C  =  a  +  0y  +  yy2  +  U  (12-18) 

then  the  consumption  function  connecting  average  income  av  y  and 
average  consumption  av  c  or  total  income  Y  and  total  consumption  C 
will  look  different  from  equation  (12-18),  even  if  all  families  have  the 
same  consumption  function  and  if  the  distribution  of  income  remains 
constant.     To  see  this,  take  just  two  families, 

ci  =  a  +  ft/i  +  y(yi)2  +  Ui 
d  =  a  +  ft/a  +  7(2/2) 2  +  u2 

add  together  and  divide  by  2  to  get 

avc  =  a  +  /?av2/  +  27(av  y)2  —  72/11/2  +  avu       (12-19) 

and,  in  general,  with  N  individuals, 

avc  =  a  +  j3av2/  +  Nyfav  y)2  —  7  Y  Mi  +  av  «     (12-20) 

One  might  argue  that,  when  income  distribution  remains  unchanged, 
the  cross  term  Zyflj  remains  constant  and  is  absorbed  into  the  estimate 
of  a.  But  this  is  false,  because  the  cross  term  appears  in  (12-20) 
multiplied  by  7,  another  unknown  parameter,  whose  estimate  is  bound 
up  with  the  estimates  of  a  and  p  in  the  least  squares  (or  other)  formulas. 
The  discrepancy  between  (12-20)  and  (12-19)  affects  the  estimates  of 
all  three  parameters  a,  ft  and  7  if  no  allowance  is  made  for  the  extra 
terms  of  the  average  consumption  function.  The  last  two  terms  of 
(12-20)  are  equal  to 

»  —  i 

that  is  to  say,  the  raw  moment  m[u  of  the  family  incomes.  It  follows 
that,  for  time  series  and  cross-section  studies  to  give  agreeing  results, 
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the  average  (or  the  total)  consumption  function  must  contain  a  term 
m/y  expressing  inequality  of  income,  even  if  this  inequality  should 
remain  unchanged  from  year  to  year.  Moreover,  neither  the  sample 
variance  of  av  y  nor  the  Pareto  index  is  suitable  for  the  correction  in 
question. 

If  income  distribution  varies  with  time,  to  get  the  two  approaches 
to  agree  our  correction  must  be  more  elaborate,  because  the  factor 


i 
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must  be  calculated  anew  for  each  year  of  the  data.  If  we  have  no 
complete  census  of  all  families,  a  sample  estimate  of  2  ?/,•?/,•  will  be  better 
than  nothing.  If  a  census  of  families  exists  but  we  are  in  a  hurry, 
again  we  can  approximate  Zysjj  to  any  desired  degree  by  taking  the 
families  in  large  income  strata. 

Statistical  differences 

Let  us  assume  that  the  consumption  function  of  a  family  is  linear 
and  constant  over  time  and  that  it  involves  another  variable  x, 
reflecting  some  circumstance  of  the  family,  like  age. 

c  =  a  +  fry  +  yx  +  u  (12-21) 

However,  let  the  characteristic  z,  as  time  passes,  have  a  constant 
distribution  among  the  several  families.  For  example,  in  a  stationary 
population,  the  ages  of  the  totality  of  families  remains  unchanged, 
although  the  age  of  any  given  family  always  increases.  If  we  aggregate 
(12-21)  we  get 

avc  =  a-f-/3av?/-f7ava;-r-avw  (12-22) 

but  x,  being  a  constant,  is  absorbed  into  the  constant  term  when  we 
estimate  (12-22).  Not  so  if  we  trace  the  history  of  one  such  family  by 
estimating  (12-21)  from  time  series. 

In  practice  there  is  a  further  complication:  the  characteristic  x  is 
not  independent  of  the  family's  income;  thus,  /§  and  $  are  shaky  esti- 
mates in  (12-21)  because  of  multicollinearity.  This  is  an  additional 
reason  why  time  series  and  cross  sections  disagree. 

Thus,  we  do  not  need  to  go  so  far  afield  as  to  postulate  several  kinds 
of  consumption  functions  (long-term,  short-term)  to  explain  these 
discrepancies.     If,  after  we  have  corrected  for  the  algebraic  and 
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statistical  sources  of  discrepancy,  some  further  disagreement  remains 
unexplained,  that  is  the  time  for  additional  theories. 


Further  readings 

Kendall,  vol.  2,  devotes  two  lucid  chapters  to  the  algebra  and  statistics 

of  univariate  time  series. 

The  proof  that  ignoring  the  serial  correlation  of  the  random  term  in  a  single 
equation  leaves  least  squares  estimates  unbiased  and  consistent  can  be  found 
in  F.  N.  David  and  J.  Neyman,  "Extension  of  the  Markoff  Theorem  on  Least 
Squares"  (Statistical  Research  Memoirs,  vol.  2,  pp.  105-116,  December,  1938). 

How  to  treat  serial  correlation  is  discussed  by  D.  Cochrane  and  G.  H. 
Orcutt,  "Application  of  Least  Square  Regression  to  Relationships  Containing 
Auto-correlated  Error  Terms"  (Journal  of  the  American  Statistical  Association, 
vol.  44,  no.  245,  pp.  32-61,  March,  1949). 

Eugen  Slutsky,  "The  Summation  of  Random  Causes  as  the  Source  of 
Cyclical  Processes"  (Econometrica,  vol.  5,  no.  2,  pp.  105-146,  April,  1957), 
is  rightly  famous  for  its  contribution  to  theory  and  its  interesting  experi- 
mental examples  with  random  series  drawn  from  a  Soviet  government  lottery. 

Correlogram  and  periodogram  shapes  are  discussed  in  Kendall,  vol.  2, 
chap.  30. 

The  brief  discussion  of  autocorrelation,  with  examples,  in  Beach,  pp.  17&- 
180,  is  simple  and  useful. 

The  early  article  by  Edwin  B.  Wilson,  "The  Periodogram  of  American 
Business  Activity"  (Quarterly  Journal  of  Economics,  vol.  48,  no.  3,  pp.  375-417, 
May,  1934),  is  both  ambitious  and  sophisticated. 

Tjalling  C.  Koopmans,  in  his  review,  entitled  "Measurement  without 
Theory, "  of  Arthur  F.  Burns  and  Wesley  C.  Mitchell's  Measuring  Business 
Cycles  (Review  of  Economic  Statistics,  vol.  29,  no.  3,  pp.  161-172,  August,  1947), 
delivers  a  classic  and  definitive  criticism  of  some  investign  tors'  avoidance  of 
explicit  assumptions.  All  would-be  chartists  should  read  it.  Koopmans  also 
gives,  on  p.  163,  a  summary  account  of  the  National  Bureau  method  for 
isolating  cycles. 

J.  Wise,  in  "Regression  Analysis  of  Relationships  between  Autocorrelated 
Time  Series"  (Journal  of  the  Royal  Statistical  Society,  ser.  B,  vol.  18,  no.  2, 
pp.  240-256,  1956),  shows  that,  in  recursive  systems  of  two  or  more  equations, 
least  squares  is  biased  both  when  the  random  terms  of  the  separate  equations 
are  interdependent  and  when  the  random  term  of  either  equation  is  serially 
correlated. 

The  reference  of  Sec.  12.12  is  G.  H.  Orcutt,  "A  Study  of  the  Autoregressive 
Nature  of  the  Time  Series  Used  for  Tinbergen's  Model  of  the  Economic 
System  of  the  United  States  1919-1932,"  with  discussion  (Journal  of  the 
Royal  Statistical  Society,  ser.  B,  vol.  10,  no.  1,  pp.  1-53,  1948).  Arthur  J. 
Gartaganis,   "Autoregression  in  the  United  States  Economy,   1870-1929" 
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(Economctrica,  vol.  22,  no.  2,  pp.  228-243,  April,  1954),  uses  much  longer  time 
series  and  concludes  that  the  over-all  autorcgressive  structure  changed 
drastically  around  the  year  1913.    Gartaganis  uses  six  lags. 

I  have  discussed  the  mathematical  properties  of  the  diffusion  index  in 
"Must  the  Diffusion  Index  Lead?"  (American  Statistician,  vol.  11,  no.  4, 
pp.  12-17,  October,  1957).     Geoffrey  Moore's  comments  are  on  pp.  16-17. 

Trygve  Haavclmo,  "Family  Expenditures  and  the  Marginal  Propensity 
to  Consume"  (Economctrica,  vol.  15,  no.  4,  pp.  335-341,  October,  1947), 
reprinted  as  Cowles  Commission  Paper  26,  affords  a  good  exercise  in  the 
decoding  of  compact  econometric  argument.  Haavelmo  deals  with  the 
discrepancies  arising  from  different  ways  of  measuring  the  consumption 
function. 


APPENDIX  A 


Layout  of  computations 


I  recommend  a  standard  layout,  no  matter  how  large  or  small  the 
model  or  what  estimating  procedure  one  plans  to  use  (least  squares, 
maximum  likelihood,  limited  information)  or  what  simplifying  assump- 
tions one  has  made.    There  are  three  general  rules  to  follow:] 

1.  Scale  to  avoid  large  rounding  errors  and  to  detect  other  errors  more 
easily.    Scaling  should  be  applied  in  two  stages. 

a.  Scale  the  variables. 

b.  Scale  the  moments. 

2.  Use  check  sums. 

3.  Compute  all  the  basic  moments.    This  may  seem  redundant,  but 
is  actually  very  efficient  if  one  wants 

a.  To  compute  correlations. 

b.  To  experiment  with  alternative  models. 

c.  To  get  least  squares  first  approximations. 

d.  To  select  the  best  instrumental  variables. 
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The  rules  in  detail 

Stage  1 

Scale  the  variables.  Express  all  of  them  in  units  of  measurement 
(say,  cents,  tens  of  dollars,  thousands,  billions,  etc.)  that  reduce  all  the 
variables  to  comparable  magnitudes.  Scale  the  units  so  as  to  bring 
the  variables  (or  most  of  them)  into  the  range  from  0  to  1.  For 
instance: 

National  income  x\  =  0.475  trillion  dollars 

Hourly  wage  rate  x2  =  0.182  tens  of  dollars 

Population  x3  =  0.165  billions 

Price  of  platinum  Xa  ■■  0.945  hundreds  of  dollars  per  ounce 

This,  rathei  tfian  the  range  1  to  10  or  10  to  100,  is  preferred,  because 
we  shall  include  an  auxiliary  variable  identically  equal  to  1.  Then  all 
variables,  regular  and  auxiliary,  are  of  the  same  order  of  magnitude. 

Stage  2 

Arrange  the  raw  observations  as  in  Table  A.l.  Note  that  the 
endogenous  variables,  the  y's,  are  followed  by  their  check  sum  Y  and 
that,  in  addition  to  all  the  exogenous  variables  t\%  z2y  .  .  .  ,  z#_i,  we 
devote  a  column  to  the  constant  number  1,  which  is  defined  as  the  last 
exogenous  variable  2//.  These  are  then  followed  by  the  check  sum  Z 
of  the  exogenous  variables  including  zh  —  1  and  by  a  grand  sum 
X  -  Y  +  Z. 

Stage  3 

The  raw  moment  of  variable  p  on  variable  q  is  defined  as 

where  the  sum  is  over  the  sample.  A  raw  moment  is  not  the  same 
thing  as  the  simple  moment  mpq  defined  in  the  Digression  of  Sec.  1.2. 
The  simple  moment  mpq  is  also  called  the  (augmented)  moment  from  the 
mean  of  variable  p  on  variable  q. 

Compute  the  raw  moments  of  all  variables  on  all  variables.  This 
gives  the  symmetrical  matrix  m'  of  moments,  shown  in  Table  A.2. 
In  Table  A.2  the  symbol  m'  is  omitted,  and  only  the  subscripts  appear; 
for  instance,  yoyi  stands  for  m'vgUtm 
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Table  A.l 
Arrangement  of  raw  observations 


Endogenous 

Check 

SUM 

Exogenous  variables 

Check 
sum 

Grand 

Time 

VARIABLES 

Regular 

Aux. 

SUM 

1 
2 
3 

8 

yid)  • 
Vi(2)   ■ 
yi(3)   • 

Vi(S)  • 

•  yo(D 

•  ya(2) 
'  fr(3) 

•  yo(8) 

YQ) 

Y(2) 
F(3) 

Y(S) 

zi(2)   • 
H(S)  • 

9l<8)    ' 

•  *if-i(2) 

•  fjf-ift) 

1 
1 
1 

1 

WO 

2(3) 

xm 

xm 
xm 

X(S) 

Table  A.2 


2/i2/i  '  ' 

•  yiya 

ViY 

yi«i    2/i«2  • 

•  •  yiZH-i 

y\  •  1 

ya 

mX 

■ 

. 

. 

. 

i 

. 

. 

. 

. 

* 

yayi  •  • 

•  ycs/o 

yoY 

yo2i    yo^a  • 

'    •   1/02/f-l 

ya- 1 

yoZ 

y0X 

YVx  •  • 

•  Yy0 

YY 

K*,    Yz2  • 

•  •  Yzh-i 

r-i 

rz 

YX 

Z\V\  -  - 

•  Zxya 

z\y 

Zl2l      2l«2    ' 

'    '    Z\ZH-\ 

fl-1 

zxZ 

SiX 

ziyi  '  ' 

-  z2ya 

z2Y 

Z2Z1      Z&t    • 

•    Z&H-X 

*,.l 

ZiZ 

HX 

• 

' 

• 

• 

• 

• 

• 

• 

• 

• 

Uff-iyi*  • 

'  ZH-iyo 

ZH-l 

ZH-\Z\      ZH-lZ*   ' 

•    '  ZH-lZH-l 

ZH-1'1 

2*-»Z 

ZH-lX 

— » 

1  •  V\  •  • 

•l-2/o 

i.y 

I'Zl            1  •  Z%   • 

•    1  '  2ff-l 

1-1 

\z 

IX 

Zyi  •  • 

•  Zy0 

zr 

Z«i                £«2  • 

•  Zzir-i 

z-1 

zz 

ZX 

Xy,  •  • 

•  Xya 

xr   x«,        x«8  • 

•    XZH-I 

XI 

XZ 

XX 
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Stage  4 

Compute  the  augmented  moments  from  the  mean  of  each  variable 
(except  zh  —  1)  on  each  variable,  e.g., 

mXlXi  «  SmXiXi  -  mXx.xm\.x% 

This  is  done  very  easily  because  mXi.x  and  mXx.x  are  always  on  the  level 
indicated  by  the  arrows  and  in  the  row  and  column  corresponding  to 

This  procedure  gives  a  square  symmetric  matrix  m  of  moments 
from  the  mean.  The  new  matrix  contains  one  row  and  one  column 
less  than  the  matrix  m\ 

Stage  5 

Rule  for  check  sums.  In  both  m  and  m'  any  entry  containing  a 
capital  Y  (or  Z)  is  equal  to  the  sum  of  all  entries  in  its  row  that  contain 
lower-case  y's  (or  z's).  Any  entry  containing  a  capital  X  is  equal  to 
the  sum  of  everything  that  precedes  it  in  the  row. 

All  these  things  are  true  in  the  vertical  direction,  since  the  matrices 
m'  and  m  are  symmetric. 

Stage  6 

Scale  the  moments.  This  step  is  not  always  possible.  Scan  the 
symmetric  matrix  in.  If  it  contains  any  row  (hence,  column)  of 
entries  all  (or  nearly  all)  of  which  are  very  large  or  very  small  relative 
to  the  rest  of  the  rows  and  columns,  divide  or  multiply  the  entire 
offending  row  and  column  by  an  appropriate  power  of  10.  The  purpose 
is  to  make  the  matrix  m  contain  entries  as  nearly  equal  as  possible. 
When  moments  have  comparable  magnitudes,  matrix  operations  on 
them  are  very  accurate,  rounding  errors  are  small,  and  calculating 
errors  can  be  readily  detected. 

Keep  accurate  track  of  the  variables  that  have  been  scaled  up  or 

down  in  stages  1  and  6  and  of  how  many  powers  of  10  in  each  stage 

and  altogether. 

. 
Stage  7 

Coefficients  of  correlation.  These  can  be  computed  very  easily 
from  m,  but  unfortunately  the  checks  do  not  work  in  this  case.    So 
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drop  the  check  sums  and  consider  only  part  of  m.  The  saiftpl© 
correlation  coefficient  between,  say,  the  variables  ya  and  zh  is 

"iviVfirhkn 

Coefficients  of  correlation  are  used  informally  to  screen  out  the  most 
promising  models  (see  Chap.  10  on  bunch  maps). 

Matrix  inversion 

This  is  a  frequent  operation  in  estimating  medium  and  large  systems. 
Details  for  computing 

M-1        and        MhW 

are  given  in  Klein,  pp.  15 Iff.  There  are  various  clever  devices  for 
inverting  a  matrix  and  performing  the  operation  M-1N.  Electronic 
computers  have  standard  programs,  and  it  is  well  to  use  them  if  they 
are  available.  If  M  is  small  in  size  and  if  both  M"1  and  M""*N  are 
wanted,  do  the  following:  Write  side  by  side  the  matrix  M,  the  unit 
matrix  of  the  same  size,  and  then  N. 

[M][I][N] 

Then  perform  linear  combinations  on  the  rows  of  the  entire  new 
matrix  [MIN]  in  such  a  way  as  to  reduce  M  to  a  unit  matrix.  When 
you  have  finished,  you  will  have  obtained 

[I][M~l][M-lN] 
For  example,  let 

n/r     T2  41     K    re  30  sol 


We  shall  trace  the  evolution  of  [M][I][N]  into  [IHM^HM^N], 

4 
6 


(RON). -[J    4 


1    0 
0    1 


6    30    50 
2      1      3 


Divide  the  first  row  by  2. 

(MIN)i-[[ 


2  i  y2  0 

6  I    0     1 


3    15    25 
2      1      3 
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In  this  new  matrix,  subtract  row  1  from  row  2 

j     M 


(MIN),  =  [J 
Divide  the  new  second  row  by  2. 

(min),*  [J  ; 


3        15 
-1     -14 


0  !        3 


Yi 


15 

-7 


25] 
-22j 


25 
-11 


Subtract  the  new  second  row  from  the  first. 


(MIN), 


A1  ° 

[0     2 


%     -X 


-k     y2 


3H 


22        36] 
-7     — 11 J 


Divide  the  new  second  row  by  2 
0 


(MIN), 


1     0 

Lo_j. 

[i] 


H    -K 


22 


36     ] 
/  ^-5^J 

[M-lN] 


[M-ll 

One  can  compute  a  string  of  "quotients"  M-1N,  M_1P,  M_1Q,  etc., 
by  tacking  on  N,  P,  Q  and  performing  the  linear  manipulations.  This 
technique  works  in  principle  for  all  sizes  of  matrices  with  more  than 
three  or  four  rows,  but  it  consumes  a  lot  of  paper  and  time. 
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Stepwise  least  squares 


Estimating  the  parameters  of 

V  =  7o  +  71*1  +  72*2  +  73*8  +   '  •  •  +  7//*//  +  U 

by  desk  calculator,  according  to  Cramer's  rule,  or  by  matrix  inversion 
is  a  formidable  task  when  //  is  greater  than  3.  The  stepwise  procedure 
about  to  be  explained  may  be  slow,  but  it  has  three  advantages  over 
the  other  methods: 

1.  It  can  be  stopped  validly  at  any  stage. 

2.  It  possesses  excellent  control  over  rounding  and  computational 
errors. 

3.  We  do  not  have  to  commit  ourselves,  ahead  of  time  and  once  and 
for  all,  on  how  many  decimal  places  to  carry  in  the  course  of  computa- 
tions, but  we  may  rather  carry  progressively  more  as  the  estimates  are 
successively  refined. 

I  shall  illustrate  the  method  by  the  simple  case 

w  =  ax  +  &y  +  7Z  +  u 

where  (in  violation  of  the  usual  conventions)  w  is  endogenous,  and 
x,  y,  z  are  exogenous.  For  the  sake  of  illustration  let  us  assume  that, 
in  the  sample  we  happen  to  have  drawn,  the  exogenous  variables  are 
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"slightly"  intercorrelated,  so  that  mxv,  mXBf  mxu,  mva,  m„u,  mau  are 
small  numbers,  although,  of  course,  mxx  —  mvv  =  mtt  —  1. 

Step  1 

On  the  basis  of  a  priori  information,  arrange  the  exogenous  variables 
from  the  most  significant  to  the  least  significant,  that  is  to  say,  accord- 
ing to  the  imagined  size  of  the  parameters  a,  0,  7,  disregarding  their 
signs: 

M  >  101  >  M 

Step  2 

To  estimate  a  first  approximation  to  a,  compute  &\  =  7nwx/mxx  as  an 
approximation  to  the  true  value.    Let  &\  *■  a  +  A%. 


StepS 

Form  a  new 

variable 

V  «*  W  —  OL\X 

and  estimate  a 

first  approximation  to  P  by  computing 

• 

*       mvy 
muy 

Step! 

Form  the  new  variable 

s  =  v  -  fry 

and  then  compute  the  first  approximation  to  7: 

■ 

m„ 

Step  5 

Form  the  new  variable 

Wl"«-  7i2 

and  compute 

*"/il    —    

mxx 

The  idea  here  is  to  estimate  the  error  A\  of  our  first  approximation 
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&i.    Compute 

at  =  5i  —  Ai 

as  a  second  approximation  to  a. 

Step  6 

Now  that  a  better  estimate  of  a  is  available,  there  is  no  point  in 
correcting  the  first  approximations  &,  71.  We  discard  them  and 
attempt  to  get  new  approximations  &,  72,  based  on  the  better  estimate 
a2.    We  first  use  a2  to  define  a  new  variable 

Vi  —  w  —  a2x 

Note  that  in  this  step  we  adjust  the  original  variable  w  (not  wx). 
Proceed  now  as  in  steps  3  to  5 : 

P2   ==   -—* 


m 


vv 


si  =*  vi  -  (32y 

72   =    ^ 

W2  =  Si  —  722 

S3   =    «2   —    A2 

and  so  on. 

The  method  of  stepwise  least  squares  does  indeed  yield  better  esti- 
mates in  each  round.     To  see  this,  consider  the  steps 

-    =  mwx  --  a  Jb  Wtfu+y+y)-* 
011       mtx  ~  mxx 

=  a  +  Ax 

v  =  w  —  <iix  =  ax  4-  /%  -f-  72  +  w  —  ax  —  Atf 

=  #2/  +  72  +  w  -  Aix 

*       mvy              niiyz+u-Axx)^ 
"l      ^T"  =  "  "+■ ™ 

Tftyy  TTlyy 

s  =  v  -  &*/  =  0*/  4-  72  4-  w  -  Atf  -  Py  -  Bxy 
=  72  4-  w  —  i4iz  —  Biy 

m2.  m„ 


=  74-C 
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t0i  «  u  —  fi«  «■  72  -f  u  —  Ais  —  Biy  —  7i*  —  Ciz 
=  u  —  Aire  —  #iy  —  Ciz 

mzx  mxx 

«i  -  fii  -  JTi  -  a  +  ki  -  At  +  B±igiHzgi±f 

I       *W<U-ll|y-C,I>.* 

=  a  -t- 


mxx 


=  a  + A 


The  residual  factor  A2  is  of  smaller  order  of  magnitude  than  Ai, 
and  so  d2  is  better  than  <5i  as  an  estimate  of  a.    To  see  this  consider 

mxx  ?7lxi  mxx 

Expressing  B\  and  Ci  in  terms  of  Ai, 

3  *  L  m«  m«       mw  m**  **  m«  mw  ^«  J 

[#&„*  ^*y  W«<  77lyg  771,2,  I 

mxx  mvv       raxx  m„  mvl/J 

,    |7nUI  ^_  mvx  mUi/  __  max  mUM   .   muv  mtx  myt  1 

L  7nxx       7nxx  rriyy  mxx  mM       wvv  mxx  m,tu  J 

Each  bracketed  term  is  of  small  order  of  magnitude.  So,  unless  y  is 
numerically  very  large  (which  was  guarded  against  in  step  1),  it 
follows  that  a2  is  an  improvement  over  &i.  The  same  can  be  shown 
for  J}*,  72,  and  <53,  compared  with  &,  71,  and  52,  respectively. 

The  method  of  stepwise  least  squares  can  also  be  used  when  x,  yf  and 
z  are  endogenous  variables.  In  this  case,  although  the  bracketed 
terms  are  not  negligible,  they  keep  decreasing  in  successive  rounds  of 
the  procedure.  Had  another  variable  been  treated  as  the  independent 
one,  the  stepwise  method  (like  any  procedure  based  on  naive  least 
squares)  would,  in  general,  have  given  another  set  of  results. 
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Subsample  variances 
as  estimators 


Consider  the  model  y  =  a  +  yz  +  u,  under  all  the  Simplifying 
Assumptions. 
Let 

be  the  maximum  likelihood  (and  also  the  least  squares)  estimate  of  7 
based  on  a  sample  of  size  S.     Its  variance  is 

«i,m  = ««  -  *f)*  = ««  -  yy  =  «feV 

|"(mi?i  +  •  •  •  +  ttsza)'"! 
-      [      M  H +  4)2      J 

Holding  z\,  ,  .  .  f  zs  fixed,  this  reduces  to 

(*!+•••  +  4)2  *!  +  •  •  •  +  4      m„ 

Let  us  now  ask  what  happens  on  the  average  if  from  the  original 
sample  we  obtain  its  S  subsamples  (each  of  size  S  —  1),  if  we  compute 
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the  corresponding  parameters 

and  if  we  then  compute  the  sample  variance  V  of  these  fs. 

SV  m  V  [f  («)  -  av  f  (•)]»  -  V  (B.)2  -  §  ( V  *.)' 

e(5F)  -  SuV  m  g  Y(£.)a  -  |  e  ( V  #.Y 

For  our  fixed  constellation  of  values  («i,  .  .  .  ,  zs)  of  the  exogenous 
variable,  any  terms  of  the  form  $(%%&&)  (t  3^  j)  equal  zero.  By 
careful  manipulation  we  obtain 

sev  -  (rttU  r  y  —  j—3  ~  ±  y  — x—2 

a  a 

£  Z/  (mta  -  z\)(m„  -  z^J 
»<> 

So  far  this  is  an  ezed  result.     If  we  make  the  further  assumption  that 

the  values  zi,  .  .  .  ,z5  of  the  exogenous  variable  are  spread  "typically," 

the  relations 

.      S-  1 
m„  —  z;  =  — •$ —  mlt 

m„  -  z?  -  zj  =      g     m„ 

are  approximately  true,  or  at  least  become  so  in  the  course  of  the 
summations  given  in  the  last  brackets.     Therefore, 

It  follows  that  (S  —  l)V  is  an  unbiased  estimator  of  cov  (^,^|S). 
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Proof  of  least  squares  bias 
in  models  of  decay 


Let  the  variable  5«  be  equal  to  1  if  time  period  t  is  in  the  sample,  and 
zero  otherwise.     The  least  squares  estimate  of  7  is 


*- 


y  hytyt-i 

j 

2  fctf-i 


The  proposition  e(^)  <  7  will  be  proved  by  mathematical  induction. 
It  will  be  shown  true  for  an  arbitrary  sample  of  S  =  2  points;  then  its 
truth  for  £  +  1  will  be  shown  to  follow  from  its  truth  for  any  & 

Definition.  A  conjugate  set  of  samples  contains  all  samples  having 
the  following  properties : 

1.  The  samples  include  the  same  time  periods  and  skip  the  same 
time  periods  (if  any) ;  let  h  be  the  first  period  included  and  ta  the  last. 

2.  If  time  period  t  is  included  in  the  samples,  then  all  samples  of  the 
conjugate  set  have  disturbances  ut  of  the  same  absolute  value.  The 
disturbances  do  not  have  to  be  constant  from  period  to  period. 

209 


210  APPENDIX  D 

3.  When  a  time  period  is  skipped  by  the  sample,  algebraically  equal 
disturbances  must  have  operated  on  the  model  during  the  skipped 
periods. 

4.  The  samples  have  come  from  a  universe  having,  as  of  tl}  the  same 
values  for  all  predetermined  variables. 

Consider  an  arbitrary  sample  of  two  points.  Let  one  point  come 
from  period  j  and  the  other  from  period  k  (k  >  j) ;  the  sample  can  be 
completely  described  by  the  disturbances  operating  at  and  between 
these  two  time  periods;  that  is, 

Si  =  (+ty,ty+i,  •  •  •  ,w*_i,+WjO 

Si  has  three  conjugates: 

S2  =  (+tt/,%+i,  .  .  .  ,w*_i,-w*) 
Sa  =  (— Uj,Uj+ii  .  .  .  ,uk-i,+uk) 
S4  =  (— Uj'M+u  .  .  .  ,Wfc_i,-w*) 

Denote  the  four  corresponding  least  squares  estimates  of  y  by  the 
symbols  t(++),  7(+~),  7(-+),  and  ?(--). 

By  definition,  each  of  the  four  conjugate  samples  has  inherited  from 
the  past  the  same  value  ?//_i  of  lagged  consumption.  In  period  j  the 
random  disturbance  Uj  operates  positively  for  samples  Si  and  S2  and 
negatively  for  S3  and  S4.  Therefore,  in  the  next  period  Si  and  S2 
inherit  one  value  v  =  yyj-i  +  u,  for  lagged  consumption,  and  samples 
S3  and  S«  inherit  another  value  n  =  72//-.  1  —  Uj.  By  the  definition  of 
conjugates,  in  periods  j  +  1,  j  +  2,  .  .  .  ,  k  —  1,  equal  random 
disturbances  affect  all  samples  Si  to  S4.  Moreover,  in  period  /c, 
samples  Si  and  S2  receive  an  equal  inheritance  of  lagged  consumption 
from  the  past.  Call  it  yp.  Its  exact  value  can  be  obtained  by  applying 
model  (3-2)  (see  Sec.  3.2)  to  p  successively  enough  times,  but  this  value 
is  of  no  interest.  Samples  S3  and  S4  each  get  the  inheritance  yn  which 
likewise  arises  from  the  application  of  model  (3-2)  to  n.  The  two 
inheritances  are  different:  yp  >  ynf  since  p  >  n. 

When  we  come  to  period  &,  samples  Si  and  S2  part  company,  because 
the  first  receives  a  boost  +uk)  and  the  second  receives  the  opposite 
—  Uk.     For  the  same  reason  S3  parts  company  with  S4. 

Define 

q  =  TJP  +  uk        r  =  yyp  -  uk        v  =  yyH  -f-  uk        w  =  yyn  -  uh 
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The  four  conjugate  estimates  are 

•V        1/    "        ,i2     ,      i     7i2  '\  /    "         ,.2  i     ,,2 

i/y-i  T*  2/n  i/;-l  T  £'n 

Symbolize  the  sum  of  these  four  estimates  by  Y  ^  (±  ±)  or    Y    f. 

ceaj  Si 

v«/4.  4.\  -  2pyy-i  +  (g  +  r)yp       2nyy.i  +  (g  4-  io)y, 
\  */,2-i  +  2/p    T    nJLt  +:.yS  / 

\        vU  +  vl      vU  +  vlJ 

=  Ay  -f  residual 

The  residual  is  always  negative,  because  y\  >  y\  if  y>_i  >  0,  and 
2/?  <  yl  if  2/;-i  <  0-  Therefore  the  average  ^  estimate  from  this 
conjugate  set  of  samples  is  less  than  the  true  7. 

Consider  an  arbitrary  sample  of  size  S  +  1,  which  I  call  sample 
B(-f).  Let  it  contain  observations  from  time  periods  ji,  j2,  .  .  .  tjs, 
js+i  (which  need  not  be  consecutive).  B(+)  can  be  completely 
described  by  the  disturbances  that  generated  it  plus  the  predetermined 
condition  ?/;v-i. 

B(+)  -  (Vh-il*tMu  •  •  •  tUfcWiJ 

Now  consider  another  sample  A  which  contains  one  time  period 
(the  last  one)  less  than  B(-f-)  but  which  is  in  all  other  respects  the 
same  as  B(-h): 

A  -  (sfo-ij  uhtuJv  .  .  .  ,uJs) 

The  conjugate  set  of  A  can  be  described  briefly  by 

conj  A  «  (yyri;  ±ujlf±ujv  .  .  .,±uJa) 

The  conjugate  set  of  B(+)  has  twice  as  many  samples  as  conj  A;  the 
elements  of  conj  B(+)  can  be  constructed  from  elements  of  conj  A  by 
adding  another  period  in  which  the  disturbance  wya+,  shows  up  once 
with  a  plus  sign  and  once  with  a  minus  sign. 
Define  B(— )  as  the  sample  consistent  with  predetermined  condition 
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yix-i  and  containing  all  the  disturbances  of  sample  B(-f )  identically, 
except  the  last,  which  takes  the  opposite  sign.     Therefore,  if 

B(  +  )   =   (fr,-i;tCft,1ffe    •    •    •    ,Wfc,+Wy*J 

then  'B(-)  =  (y^-i)  uh,ujv  .  .  .  ,uJa, - uja+l) 

Assume  that  the  estimates  ^  derived  from  conj  A  average  less  than 
the  true  7  (0  <  7  <  1).     Symbolize  this  statement  as  follows: 

2t(±  ±   •  '  *   ±)  <2S7 
Each  ^  in  the  above  sum  is  a  fraction  of  the  form 

Let  ^(A)  stand  for  the  estimate  derived  from  sample  A.  Then  ^(A) 
can  be  expressed  as  a  quotient  N/D  of  two  specific  sums  N  and  D, 
where  D  is  positive.  Each  sample  from  the  set  conj  B  gives  rise  to  an 
estimate  of  7.  The  formulas  now  are  fractions  like  (1),  but  the  sums 
have  one  more  term  in  the  numerator  and  one  more  in  the  denominator, 
because  one  more  period  is  involved.     Thus,  Writing  yf  for  yJa) 

N  +  y'W  +  tffrj       N  +  yyf*  -  y'uja_x 


*[B(+)] 

Similarly, 

It  follows  that 


D  +  y"*  D  +  7/'2 


w><->]  -  N+tl7> 


?[B(+)]  +  ?[B(-)]  =  2^^ 

If  -y(A)  >  7,  then  the  last  fraction  is  less  than  7(A);  if  7(A)  —  7, 
then  the  fraction  equals  7;  if  7(A)  <  7,  then  the  fraction  is  less  than  7. 
Exactly  the  same  is  true  for  all  samples  in  the  conjugate  set  of  A. 
Therefore, 

2  >  <  2  2  v  <  2S+ly 

conj  B  conj  A 

vrhich  completes  the  proof. 

I  shall  not  discuss  what  happens  if  the  value  of  7  does  not  lie  between 
0  and  1,  because  no  new  principle  or  new  difficulty  arises.  As  an 
exercise,  find  the  bias  of  7  if  7  lies  between  0  and  —  1. 
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Completeness  and  stochastic 
independence 

/ 


The  proof  that  s(ug,up)  =  0  implies  det  B  5^  0  is  by  contradiction. 
If  det  B  =  0,  then  a  nontrivial  linear  combination  of  B's  rows  is  0  (the 
zero  vector) : 

L  =  Xi?i  +•■■••.  +  \g$q  -  0  (1) 

Hence,  Xigiy2  +  •  •  •  +  \o$0y2  =  0 

But,  by  the  model  By  +  Tz  =  u,  we  also  have 

So  XiWi  +•••.  +  \QuQ  =  (Xiyi  +.•••+  XoYg)z  =  Z  (2) 

Since  Z  is  a  constant  number  for  any  constellation  of  values  of  the 
exogenous  variables,  we  have  in  equation  (2)  a  nontrivial  linear  relation 
among  the  disturbances  U\,  .  .  .  ,  Uq.    This  contradicts  the  premise 
that  they  are  independent. 
This  argument  shows  that  £(u0)up)  —  0  if  and  only  if  det  B^O. 
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The  asterisk  notation 


A  single  star  (*)  means  presence  in  a  given  equation  of  the  variable 
starred;  a  double  star  (**)  means  absence. 
Accordingly,  in  the  model 

Vl  +  7ll*l  +  712*2  +  713*3  +  714*4   =   Ml  (1) 

0212/1  +  2/2  +  023?/3  +  721*1  +  Y22*2  H"  723*3  =   ^2  (2) 

0312/1  +  2/3  +  731*1  +  732*2  =  uz  (3) 

with  reference  to  the  third  equation, 

y*  means  the  vector  of  the  endogenous  variables  present  in  the 
third  equation,  namely,  vec  (yi,yz) 
y**  =  vec  (#2) 
z*  =  vec  (*i,*2) 
z**  =  vec  (*3,*4) 

For  the  first  equation, 

y*  ■  vec  (?/i) 
y**  =  vec  (2/2,2/3) 

Z*  =   VeC  (*i,*2,*3,*4) 


Stars  (single  or  double)  may  also  be  placed  on  the  symbol  x,  which 
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stands  for  all  variables,  endogenous  or  exogenous.    Thus  for  the 

second  equation, 

x*  -  vec  (2/1,2/2,2/3,2:1,32,**) 
x**  =  vec  (z*) 

G*  is  the  number  of  if  a  present  in  the  ^th  equation ;  G**  is  the  number 
absent.  H*  is  the  number  of  z's  present;  H**  is  the  number  absent. 
Examples: 

G*  m  1    H*x  =  4    Gt*  =  0    //?  -  3    m*  m  1    H$*  m  2 

«*,  (#>  T?  are  vectors  made  up  of  the  nonzero  parameters  of  the 
^th  equation  in  their  natural  order,  a  here  serves  as  a  general  symbol, 
like  x,  for  all  parameters,  0  or  7.     Examples: 

a?  =  vec  (1,711,712,713,714) 
Bf  -  vec  (1) 
7?  =  vec  (711,712,713,714) 
7?  =  vec  (721,722,723) 
aj  =  vec  (031,1,731,732) 
yJ  =  vec  (731,732) 

In  Chap.  8, 1  place  stars  (or  pairs  of  stars)  not  on  vectors  but  on  the 
variables  themselves  to  emphasize  their  presence  in  (or  absence  from) 
an  equation.  For  instance,  in  discussing  the  third  equation  above, 
we  may  write  y* ,  t/**,  y*,  z*,  **i  z* *>  z**  to  stress  that  yh  yif  f  i,  z%  do 
appear  in  the  third  equation  whereas  the  other  variables  2/1,  z%,  zK 
do  not. 

Finally,  AJ*  means  the  matrix  that  can  be  formed  from  the  elements 
of  A  by  taking  only  the  columns  of  A  that  correspond  to  the  variables 
x**  absent  from  the  gth.  equation.     For  example, 


**  _ 


A? 


The  columns  of  A**  correspond  to  x**  =  vec  (2/2,^3). 


"0 

0" 

1 

023 

.0 

1    . 

AJ* 


corresponding  to  x**  «  vec  (y^z^zi). 


0 

713 

714 

1 

723 

0 

0 

0 

0 
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Assumptions,  additivity,  error  term,  5, 
18,22 

Simplifying,  9-17 

statistical,  4 

stochastic,  4,  6,  95 

(See  also  Error  term) 
Autocorrelation  (see  Serial  correlation) 
Autorcgression,  52-53,  61,  168,  195 

of  the  economy,  over-all,  185-186, 
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Cobweb  phenomena,  16 

Cochrane,  D.,  195 

Completeness,  86,  106,  213 
(See  also  Nonsingularity) 

Computations,  32,  197-202 

Conflicting  observations,  109 

Confluence,  103-106,  155 
linear,  142-144 

Consistency,  24,  36,  43-46,  51,  118 
in  instrumental  variables,  114 
in  least  squares,  44,  47-49,  195 
in  limited  information,  118 
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Consistency,  notation  for,  44 

in  reduced  form,  100,  103 

in  Thcil's  method,  118,  128 
Consumption  function,  cross  section 
versus  time  series,  192-196 

examples,  2,  63-72,  137-139 

secular,  bias  in,  71 
Correcting  denominators,  180-181 
Correlation,  141-142 

partial,  144-145 

serial,  168-176,  195 

spurious,  150 
Correlation  coefficient,  matrix,  145 

partial,  144-145 

Bamplc,  141-142,  155 
computation  of,  200-201 

universe,  141-142,  155 
Correlograms,  174-175,  195 
Cost  function,  21-22 
Counting  rules  for  identification,  92-96, 
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Co  variance,  19,  27,  82-83,  170,  207-208 

error  term,  30,  49 

population  and  sample,  141 
Cramer's  rule,  35,  203 
Credence,  2^,  4i-4L»,  110 
Cross-section  versus  time-series  esti- 
mates, 192-196 

(See  also  Consumption  function) 
Cyclical  form,  121 

(See  also  Recursive  models) 
Cyclical  indicators,  182 

(See  also  Business  cycle  model) 
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Decay,  models  of,  initial  conditions  in, 
53,  68-61 
least  squares  bias  in,  52-62,  171, 

209-212 
unbiased  estimation  in,  60-61 
Degrees  of  freedom,  3 
Determinants,  34-35 
Cramer's  rule,  35 
Jacobians,  29,  73-74,  79-82 
Diffusion  index,  182,  188-190,  196 
and  statistical  coverage,  191-192 
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Discontinuity,  of  hypotheses,  137-138 

probability,  9-10 
Disturbances  (see  Error  term) 
Dummy  variables,  140,  156-157,  165 


Efficiency,  24,  47,  51,  118 
in  instrumental  variables,  118 
in  least  squares,  47-49 

and  heteroskedasticity,  48 
in  limited  information,  118 
and  maximum  likelihood,  47-49 
in  Theirs  method,  118 
Eigenvector,  124 
Elasticities,  price,  and  Haavelmo's 

proposition,  72 
Equations,  autorcgrcssivc,  52-53 
simultaneous,  67-71,  73-84 

notation,  74-76 
structural,  bogus,  89-90 
Error  term,  4,  8,  9-18 

additivity  assumption,  5,  18,  22 
covariance  matrix,  30,  49 
Simplifying  Assumptions,  constancy 
of  variance  (no.  3),  13,  17,  77 
(See  also  Heteroskedasticity) 
normally  distributed  (no.  4),  14,  17, 

77 
random  real  variable  (no.  1),  9,  17, 

77 
serial  independence  of  (no.  5),  16, 

18,77 
uncorrelated,  in  multi-equation 
models  (no.  7),  78-79,  87, 
213 
with  predetermined  variable  (no. 
6),  16,  18,  33,  52,  53,  65n. 
zero  expected  value  of  (no.  2),  10, 
17,  77 
Errors,   of  econometric  relationships, 
64-76 
of  measurement,  6,  48 
in  variables,  155 
Estimate,  variance  of,  39-43 
Estimates  in  multi-equation  models,  in- 
terdependence of,  82 
Estimating  criteria,  6,  8,  23,  47 
(See  also  Consistency;  Maximum 
likelihood;  Unbiascdncss) 
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Estimation,  1-2,  8,  108-110 
simultaneous,  67-68,  126-135 
unbiased,  in  models  of  decay,  60-61 

Estimators,  extraneous,  134 
subsample  variances  as,  207-208 

Expectation,  18-20 

Expected  value,  10-11,  19-21 


Factor  analysis,  linear  orthogonal,  160- 
164 

unspecified,  156-165 

versus  variance  analysis,  164-165 
Fisher,  R.  A.,  51 
Forecasts,  criteria  for,  132-134 

leading  indicators  as,  186-188 
Fox,  K.  A.,  21,  134 
Friedman,  M.,  72,  134 
Frisch,  R.,  155 
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Gini,  C.,  51 
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Haavelmo,  T.,  71,  106,  155,  196 
Haavelmo  proposition,  64-66,  71-72, 
125 

Heteroskedasticity,  48-49 
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(See  also  Error  term,  Simplifying 
Assumptions,  no.  3) 
Hogben,  L.,  12n.,  46,  51 
Homoskcdastic  equation,  50 

(Sec  also  Heteroskedasticity) 
Hood,  W.  C,  xvii,  21,  85,  125 
Hurwicz,  L.,  22,  53,  62 
Hypotheses,  choice  of,  154-155 

discontinuous,  K>7-138 

maintained,  130-137 

null,  138 

questioned,  137 

testing  of,  136-155 


Identification,  3-4,  64,  85-106 
counting  rules,  92-96,  102 
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Identification,  exact,  88,  9£Htt,  94,  107 
absence  of,  91,  128-131 
of  parameters  in  underidcfUified 

equation,  96-97,  128 
(See  also  Overidentification ;  Under- 
identification) 
Improvement,  44 
Incomplete  theory,  5 
Independence  of  simultaneous  equa- 
tions, stochastic,  78-79,  8C-87,  213 
Initial  conditions  in  models  of  decay,  53, 

58-61 
Instrumental  variable  technique,  107- 
117 
properties,  114 

related,  to  limited  information,  118, 
125 
to  reduced  form,  113 
Instrumental  variables,  efficiency  in, 
118 
weighted,  116-117 
Interdependence,  simultaneous,  63-72 


Jacobians,  digression  on,  79-82 

of  likelihood  function,  29 

references,  84 

and  simultaneous  equations,  73-74 
Jaffd,  W.,  x 
Jeffreys,  H.,  51 


Kaplan,  W.,  84 

Kendall,  M.  G.,  xvii,  21,  50,  51,  165, 

173n.,  195 
Keynes,  J.  M.,  51 
Klein,  L.  R.,  xvii,  21,  51,  84, 106, 124n., 

125,  134,  155,  201 
Koizumi,  S.,  21 
Koopmans,  T.  C.,  xvii,  22,  72,  84,  85, 

94n.,  106,  155,  195 
Kuh,  E.,134 


Lacunes  (missing  data),  159n. 
Lagged  model,  53-54,  62 

(See  also  Recursive  models) 
Lango,  G.,  22 
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Leading  indicators,  186-188 
Least  squares,  23-50 

bias,  in  models  of  decay,  52-62,  209- 

212 
consistency,  44,  47-49,  195 
diagonal,  and  simultaneous  estima- 
tion, 07-68 
directional,  69,  125 
efficiency,  47-49 

and  hotoroskodastioity,  48 
generalized,  33-35 
Haavclmo  bias,  65-67 
justification,  31-32,  134-135 
maximum  likelihood,  24,  31-34,  47- 

49,  57 
naive,  compared  to  maximum  likeli- 
hood, 67,  82-83 
reduced  form,  88,  98,  113,   127, 

133 
references,  50-51,  134-135 
related  to  instrumental  variables, 

113-114 
relation  to  limited  information,  125 
simultaneous  estimation,  67-70 
stepwise,  203-206 
sufficiency,  47-49 
unbiasedness,  35-39,  47-49,  60,  65- 

67 
used  in  estimating  reduced  form, 
88 
Loser,  C.  E.  V.,  157 
Likelihood,  24-25,  50 
Likelihood  function,  23,  28-31 
and  identification,  90-91,  100-102 
(See  also  Maximum  likelihood) 
Limited  information,  4,  118-125,  128 
consistency  in,  118 
efficiency  in,  118 
formulas  for,  123 
relation  of,  to  indirect  least  squares, 

125 
to  instrumental  variables,  118,  125 
Linear  confluence,  142-144 
Linear  models,  multi-equation,  nota- 
tion, 74-77 
versus  ratio  models,  152-153 
Linearity,  testing  for,  150-152 
Lh,  T.  C,  130-132 


Marginal  propensity  to  consume  (see 

Consumption  function) 
MarkofT  theorem,  195 
Marschak,  J.,  21 
Matrix,  3,  27 

coefficients,  75 

inversion,  201-202 

moments,  31,  124,  198-201 

nonsinguln/ity  of,  86-87 

orthogonality,  100-164 

rank  and  identification,  93-94 

triangular  (recursive  models),  83 
Maximum  likelihood,  23,  25,  29-33,  50, 
78-84 

computation,  83 

consistency,  44-48 

efficiency,  47-49 

full  information,  73-84,  118,  129,  133 

identification,  100-103 

interdependence,  82-83 
simultaneous,  63-71 

limited  information,  118-125 

subsample  variances,  207 

value  for  forecasting,  132-133 
Mean,  arithmetic,  expected  value,  10-11 
Measurement,  errors  of,  6,  48 
Meyer,  J.  It.,  134 
Miller,  II.  L.,  Jr.,  134 
Mitchell,  W.  C,  195 
Model,  1 

business  cycle,  171-174,  183-184 

lagged,  53-54,  62 

latent,  110-112 

manifest,  110-112 

recursive,  83-84 

supply-demand,  89 

(See  also  Linear  models) 
Model  specification  (see  Specification) 
Moments,  3,  20,  32-35 

algebra  of,  51 

computation  of,  32-33,  197-200 

determinants  of,  34-35 

expectation  of,  20 

matrices  of,  34,  124 

raw,  198 

from  samplo  means,  18-20 

simple  (augmented),  20,  198 
Moore,  Y.  II.,  134,  187n.,  196 
Moving  average  technique,  61,  168 
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Moving  average  technique,  Slutsky 

proposition,  173-174 
Multicollinearity,  3,  103-106 
etymology,  105-106 
(See  also  Confluence;  Undcridentifi- 
cation) 


National  Bureau  of  Economic  Research, 

181-182,  186-188,  195 
Nature's  Urn,  11-13,  168 

subjective  belief  and,  11-12 
Ncyman,  J.,  195 
Nonsingularity,  86-87 

(See  also  Completeness) 
Normal  distribution,  multivariate,  26- 
27 

univariate,  14—15 


Observations,  7 

conflicting,  109 

(See  also  Sample) 
Orcutt,  G.  H.,  72,  185,  195 
Original  form,  87-88 

(See  also  Reduced  form) 
Orthogonality,  160-164 

test  of,  163-164 

and  variance  analysis,  165 
Oscillations,  53,  172-174 
Overdetcrminacy,  88-89 

(See  also  Ovcridentification) 
Ovcridentification,  86,  89,  98-103,  129 

contrasted  with  underidentification, 
100-103 
Overidentified  models,  ambiguity  in,  98 


P  lim  (probability  limit),  44 
Parameter  estimates,  constraints  on, 

94-95 
Parameter  space  and  identification, 

100-102 
Parameters,  2,  8 

a  priori  constraints  on,  91-94 
Pareto  index,  194 
Pearson,  K.,  51 
Periodograms,  175-176,  195 
Population,  7,  11-12,  141 
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Prediction,  2 

(See  also  Forecasts;  Leading  indica- 
tors) 
Probability,  3,  9-10,  24 

density,  9 

inverse,  and  maximum  likelihood,  51 
Probability  limit  (P  lim),  44 
Production  function,  21-22 

Cobb-Douglas,  157-159 


Quandt,  R.  E.,  x 


Random  term  (see  Error  term) 

Random  variable,  9 

Rank,  93-94 

Ratio  models  versus  linear  models,  152- 

153 
Recursive  models,  83-84,  121 
Reduced  form,  87-88 

bias,  100-103 

dual,  Theil'a  method,  126-128 

and  instrumental  variable  technique, 
113 

least-squares  forecasts,  133 
Residual  contrasted  with  error,  8 


Sample,  6-7,  11-12,  141-142 

and  consistency.  43-46 

size,  53 

and  unbiasedness,  35-43 

variance,  39-43 
Samples,  conjugate,  52,  54-57 
Sampling  procedure,  6-21 
Scaling  of  variables,  198 
Schultz,  H.,  22,  9Qn. 
Seasonal  variation,  176-178 

removal  of,  182 
Sectors,  independence  and  nonsingular- 
ity, 87 

split,  versus  sector  variable,  140, 153- 
154 
Serial  correlation,  168-171,  175,  195 

in  Simplifying  Assumptions,  16,  18, 
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